PhilterPaper / Perl-PDF-Builder

Extended version of the popular PDF::API2 Perl-based PDF library for creating, reading, and modifying PDF documents
https://www.catskilltech.com/FreeSW/product/PDF%2DBuilder/title/PDF%3A%3ABuilder/freeSW_full
Other
6 stars 7 forks source link

[RT 132844] PDF::API2 corrupts certain PNMs when embedding them #121

Closed PhilterPaper closed 3 years ago

PhilterPaper commented 4 years ago

Fri Jun 19 07:22:06 2020 jffry@posteo.net - Ticket created [Reply] [Forward] Subject: PDF::API2 corrupts certain PNMs when embedding them

#!/usr/bin/perl -w
use strict;
use PDF::API2;

my $pdf = PDF::API2-> new(-file => 'test.pdf');

my $page = $pdf-> page;
$page->mediabox(2480/300*72*5, 3506/300*72*5);

my $gfx = $page->gfx;
my $img = $pdf->image_pnm(shift);
$gfx->image($img);

$pdf->save;
$pdf->end();

Run the above script with the pnm filename as an argument, the right hand side of the resulting PDF is shifted to the left.

Investigating this, the corruption depends upon the geometry of the image - it is possible to crop the image to a size that is not corrupted in the PDF.

e.g. the pnm after this cropping option is not corrupted:

convert corrupt_no_compression.pnm -crop 790x400+500+900 corrupt_no_compression_cropped.pnm

but cropping only the width, the corruption is still there, but less:

convert corrupt_no_compression.pnm -crop 790x3506+500+0 corrupt_no_compression_cropped.pnm

equally, cropping only the height, the corruption is also still there, in a similar magnitude to the original:

convert corrupt_no_compression.pnm -crop 2480x400+0+900 corrupt_no_compression_cropped.pnm

corrupt_no_compression.pnm.txt

PhilterPaper commented 4 years ago

Reply #1: Today at 11:52:23 AM by Phil

Fri Jun 19 11:39:13 2020 PMPERRY@cpan.org - Correspondence added

I can see a problem with "insufficient data" for the image when the PDF tries to display it. This is both PDF::API2 and PDF::Builder. I am looking at it.

There are a couple odd things in your Perl script: 1) you use "shift" for the name fed to imagepnm, even though @ is empty at this point. I can avoid that error by replacing "shift" with "$ARGV[0]". Can you explain what you are doing there? 2) normally you give the lower left corner x,y in the image() call. There's no telling for sure what it's using for x and y coordinates if you omit them. However, no improvement even if I add ",0,0" to the call. I also tried adding the dimensions ",2480,3506" but that didn't help.

The PNM file itself appears to be OK -- I can display it as 2480x3506 in GIMP (grayscale 8 bit/pixel). Those are the dimensions stored in the PDF object describing the image, although the compressed stream (Flate compress) is 1047348 bytes rather than the original 8694880. Possibly something is going wrong with the compression itself -- I will have to try adding the image as UNcompressed data and see what happens. The page (media box) appears to be sufficiently large to hold the image.

Let me know if you make any headway with this, as I will be looking at it from this end.

PhilterPaper commented 4 years ago

Fri Jun 19 13:53:52 2020 jffry@posteo.net - Correspondence added

On 19/06/2020 17:39, Phil M. Perry via RT wrote:

There are a couple odd things in your Perl script: 1) you use "shift" for the name fed to imagepnm, even though @ is empty at this point. I can avoid that error by replacing "shift" with "$ARGV[0]". Can you explain what you are doing there? 2) normally you give the lower left corner x,y in the image() call. There's no telling for sure what it's using for x and y coordinates if you omit them. However, no improvement even if I add ",0,0" to the call. I also tried adding the dimensions ",2480,3506" but that didn't help.

"shift" used like that is the same as $ARGV[0], as it is a shortcut for shift @ARGV.

The PNM file itself appears to be OK -- I can display it as 2480x3506 in GIMP (grayscale 8 bit/pixel). Those are the dimensions stored in the PDF object describing the image, although the compressed stream (Flate compress) is 1047348 bytes rather than the original 8694880. Possibly something is going wrong with the compression itself -- I will have to try adding the image as UNcompressed data and see what happens. The page (media box) appears to be sufficiently large to hold the image.

Let me know if you make any headway with this, as I will be looking at it from this end.

I tried out various different filters with no success.

The header of the pnm says 2480x3506, which implies a filesize of 8694919, which is what the file is (on my system).

I would expect the flate to compress it quite well, as there quite a bit of white there. So 8x smaller is no surprise.

PhilterPaper commented 4 years ago

Fri Jun 19 22:38:00 2020 PMPERRY@cpan.org - Correspondence added

I think I've got it fixed (in PDF::Builder, PhilterPaper/Perl-PDF-Builder). It's two lines to add in PNM.pm (see GitHub entry for RT 132844 fix), and should be easy enough to add to PDF::API2. Let me know if the fix clears things up for you. This may be a code path that was never exercised for PNM before. The error I was getting was that there was insufficient data for the image... I'm not sure what you were seeing.

PhilterPaper commented 4 years ago

Sat Jun 20 06:11:37 2020 jffry@posteo.net - Correspondence added

On 20/06/2020 04:38, Phil M. Perry via RT wrote:

<URL: https://rt.cpan.org/Ticket/Display.html?id=132844 >

I think I've got it fixed (in PDF::Builder, PhilterPaper/Perl-PDF-Builder). It's two lines to add in PNM.pm (see GitHub entry for RT 132844 fix), and should be easy enough to add to PDF::API2. Let me know if the fix clears things up for you. This may be a code path that was never exercised for PNM before. The error I was getting was that there was insufficient data for the image... I'm not sure what you were seeing.

Thanks for the patch, Phil. I confirm that this fixes things for me too.

For reference, I've attached the patch adapted for PDF::API2.

PhilterPaper commented 4 years ago

Sat Jun 20 09:29:51 2020 PMPERRY@cpan.org - Correspondence added

Great! The patch is in PDF::Builder; I don't know when Steve will put it into PDF::API2. Note that I have not looked at any other PNM types/subtypes, so something similar could happen with other PNM files, too.

BTW, the defaulting of x and y to 0,0 in image() and form_image() wasn't really part of the PNM fix, but it doesn't hurt to put it in. Being undefined in your example code, they apparently were being treated as 0,0, but I don't like relying on such behavior. If someone used, e.g., the -e flag on Perl, that might be flagged as an error.

PhilterPaper commented 4 years ago

Closing

PhilterPaper commented 4 years ago

Fri Jun 26 11:54:20 2020 futuramedium@yandex.ru - Correspondence added

Hi, the specification

http://netpbm.sourceforge.net/doc/pgm.html

mentions (see #8) "single whitespace character (usually a newline)" between Maxval and raster data; that whitespace character was already consumed by "readppmheader"; the purpose of this line:

https://metacpan.org/source/SSIMMS/PDF-API2-2.037/lib/PDF/API2/Resource/XObject/Image/PNM.pm#L134

eludes me completely; was reading a PGM ever successful with PDF::API2, and was it even tested? No similar line for PPM full-color images, as I see.

Patch, as suggested, fixes the issue, but, in absence of what that line was supposed to do (such as well-known and tolerated application which produced broken PGMs, perhaps? but then what about others?), I'd simply remove it. Keep this remark documented, at least; the module has its share of issues, if whoever is going to untangle them, maybe this will alleviate the task

PhilterPaper commented 4 years ago

Fri Jun 26 16:33:49 2020 PMPERRY@cpan.org - Correspondence added

Yeah, I'm not sure just what it's supposed to be doing. What I did find was that this one read sucked in 857kB of data (well into the raster data), leaving the next read way short. The patch I gave works for THIS particular PNM file (the ONLY one I have to test with), but it shouldn't be surprising if it breaks on another file. That's not even considering other types/subtypes. Maybe it was expecting an empty or dummy line to simply eat, leaving the next read at the start of the raster data?

As PNM is apparently pretty rare (I had never seen one before this ticket), I'm not inclined to put a lot of effort into fixing it at this point. If you or someone else wants to go over it with a fine-toothed comb and fix up the code, I'll be happy to incorporate it into PDF::Builder. Otherwise, if it proves to be really buggy, I don't think it would be a tragedy to remove PNM support entirely.

PhilterPaper commented 4 years ago

Sat Jun 27 05:38:08 2020 jffry@posteo.net - Correspondence added

It's a really simple format, no compression and shouldn't be hard to fix properly:

https://en.wikipedia.org/wiki/Netpbm

You could cover the basics with six test images. I could create them for you, if you like.

I first noticed that all was not good with PDF::API2's handling of PNM ages ago. Here is another image I found in 2009:

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=514380#5

PhilterPaper commented 4 years ago

Sat Jun 27 15:58:26 2020 PMPERRY@cpan.org - Correspondence added

Well.... if it's a format that people are actually still using, and depend on, and the format is that simple, it might be worth fixing. After all, I did go and fix the BDF font support. It sounds like there are more than just six variations that would need to be handled, and tested, but if six would take care of the vast majority of users, we could at least start with that.

How about all the related formats similar to .pnm? Are you looking at those too, or just PNM itself? I just don't want to go down a rabbit hole with "please add this additional format/variation", or there are all sorts of informal variations, and it never ends. Also please pay attention to all the formats/variations that PDF::API2 currently claims to support -- would any of those be dropped?

If you're willing to contribute a list of vital formats/variations to support, and test cases for them, I can work with you on PDF::Builder's PNM support. I can't do anything about PDF::API2, although the code for Builder will probably work OK on API2 (no guarantees).

If you have a GitHub account, please open a ticket on PDF::Builder (PhilterPaper/Perl-PDF-Builder), or open an account on catskilltech.com/forum. I don't want to clog up PDF::API2 discussions with PDF::Builder support.

PhilterPaper commented 4 years ago

I'll reopen this issue simply to keep it in mind, that there is at least one request to fully support PNM and related image formats. Maybe some time when things are slow...

PhilterPaper commented 3 years ago

Thanks to a slew of test cases from Hank Ivy (@hankivy) and Jeffrey Ratcliffe, I was able to rewrite PNM.pm to properly handle a wide range of format variations (especially comments), as well as plain/ASCII modes (P1..P3) and 16 bit samples for gray/RGB (fingers crossed, test cases are few and far between). Needless to say, the 2 or 3 bugs recently found have been swatted. The only thing missing at this point are multiple images per file, but I haven't seen a clear definition of the required format, so I'm holding off for now.

No, things weren't that slow, but this was one of those things that get under my skin and I have to complete the job. At Jeff's suggestion, I first looked at a couple CPAN libraries for PNM support, but they didn't appear to handle all the format variations I saw in the test cases. Therefore, I needed to write my own.

Closing.