PhilterPaper / Perl-PDF-Builder

Extended version of the popular PDF::API2 Perl-based PDF library for creating, reading, and modifying PDF documents
https://www.catskilltech.com/FreeSW/product/PDF%2DBuilder/title/PDF%3A%3ABuilder/freeSW_full
Other
6 stars 7 forks source link

These G4 and LZW-encoded TIFFs are negated when embedded in PDF #141

Closed carygravel closed 3 years ago

carygravel commented 3 years ago

tiff.zip

carygravel commented 3 years ago

This website provides a detailed preflight, but finds many more problems than just the image:

    1.2.1: Body Syntax error, Single space expected [offset=335; key=335; line=4 0 obj << /Producer (PDF::Builder) >> endobj; object=COSObject{4, 0}]
    1.2.1: Body Syntax error, EOL expected before the 'endobj' keyword at offset 374
    1.2.1: Body Syntax error, Single space expected [offset=15; key=15; line=1 0 obj << /Type /Catalog /PageLayout /SinglePage /PageMode /UseNone /Pages 2 0 R /ViewerPreferences << /NonFullScreenPageMode /UseNone >> >> endobj; object=COSObject{1, 0}]
    1.2.1: Body Syntax error, EOL expected before the 'endobj' keyword at offset 157
    1.2.1: Body Syntax error, Single space expected [offset=164; key=164; line=2 0 obj << /Type /Pages /Count 1 /Kids [ 5 0 R ] /MediaBox [ 0 0 612 792 ] /Resources 3 0 R >> endobj; object=COSObject{2, 0}]
    1.2.1: Body Syntax error, EOL expected before the 'endobj' keyword at offset 259
    1.2.1: Body Syntax error, Single space expected [offset=266; key=266; line=3 0 obj << /ProcSet [ /PDF /Text /ImageB /ImageC /ImageI ] >> endobj; object=COSObject{3, 0}]
    1.2.1: Body Syntax error, EOL expected before the 'endobj' keyword at offset 328
    1.2.1: Body Syntax error, Single space expected [offset=381; key=381; line=5 0 obj << /Type /Page /Contents [ 6 0 R ] /MediaBox [ 0 0 174888 247824 ] /Parent 2 0 R /Resources << /ProcSet [ /PDF /Text /ImageB /ImageC /ImageI ] /XObject << /IxCBA 7 0 R >> >> >> endobj; object=COSObject{5, 0}]
    1.2.1: Body Syntax error, EOL expected before the 'endobj' keyword at offset 566
    1.2.1: Body Syntax error, Single space expected [offset=573; key=573; line=6 0 obj << /Filter [ /FlateDecode ] /Length 43 >> stream; object=COSObject{6, 0}]
    1.2.1: Body Syntax error, EOL expected before the 'endobj' keyword at offset 684
    1.2.1: Body Syntax error, Single space expected [offset=691; key=691; line=7 0 obj << /Type /XObject /Subtype /Image /BitsPerComponent 1 /ColorSpace /DeviceGray /DecodeParms [ << /BlackIs1 false /Columns 2429 /DamagedRowsBeforeError 100 /K -1 /Rows 3442 >> ] /Filter [ /CCITTFaxDecode ] /Height 3442 /Interpolate true /Length 53727 /Name /IxCBA /Width 2429 /usesGT 1 >> stream; object=COSObject{7, 0}]
    1.2.1: Body Syntax error, EOL expected before the 'endobj' keyword at offset 54731
    1.4.1: Trailer Syntax error, The trailer dictionary doesn't contain ID
    2: Unknown graphics error, TIFFFaxDecoder: Invalid code encountered while decoding 2D group 4 compressed data. for entry 'IxCBA'
    7.1: Error on MetaData, Missing Metadata Key in catalog
carygravel commented 3 years ago

verapdf is cross-platform, but I think only supports PDF/A.

PhilterPaper commented 3 years ago

I can't figure out what all these "single space expected" errors are -- I don't see any missing spaces or double spaces. Maybe once "endobj" is put on its own line, those errors will go away. That should be a trivial fix.

The other "errors" reported sound like because it's being validated for PDF/A, and not a normal PDF. I'll see if there's anything else that can easily be made to go away. I always find validators to be a royal PITA (PDF validators, HTML validators, GitHub's CI Linter, Perl Critic, etc.) -- they're not smart enough to see what's going on in context, and so generate loads of spurious errors.

But it reports the same errors for 3.019:

That suggests that even at 3.019, pre TIFF-alpha fixes, something wasn't quite right, and I'm wasting my time trying to find the difference between 3.019 and 3.021.

I've tried the pdf-online.com tool before, but found it didn't give any useful information (and is also PDF/A). It's interesting that both validators flag the length of the raster data, as being 8 short -- maybe that's what Adobe is complaining about (but why don't evince, XpdfReader, and Firefox flag this?). Anyway, I haven't found any place that Builder is changing the length of the raster data, so it might be libtiff's fault. It's also not clear how it calculated the length, to compare against the stated length. Looks like I'm going to be spending some time counting bytes again to see if the raster data is long or short. I wonder if there's a problem with line-ends CRLF vs NL here?

carygravel commented 3 years ago

Perhaps our mistake is expecting libtiff's G4 implementation to match PDF's.

carygravel commented 3 years ago

Could it be that /IxCBA /usesGT 1 shouldn't be in the PDF? That doesn't look right to me.

PhilterPaper commented 3 years ago

/usesGT 1 is just junk left over from a flag $self->{'usesGT'} = 1; used to signal which library to use. It shouldn't cause any problems (should be ignored by a Reader), but if it does, it would have to be suppressed in some way when outputting to the PDF file. /IxCBA is just an object name being used, and has nothing to do with /usesGT. Anyway, if it's really concerning, you can try editing the PDF to replace /usesGT 1 with 9 spaces and see if it makes any difference.

Perhaps our mistake is expecting libtiff's G4 implementation to match PDF's.

So what does that mean? Are you saying that libtiff might output G4-compressed image raster data that a PDF reader (at least, Adobe's) can't handle? That would be very bad. Does libtiff actually modify the raster data in any way, or just throw it (unmodified) over the transom? Why would other G4-compressed TIFFs work OK?

carygravel commented 3 years ago

So what does that mean? Are you saying that libtiff might output G4-compressed image raster data that a PDF reader (at least, Adobe's) can't handle? That would be very bad. Does libtiff actually modify the raster data in any way, or just throw it (unmodified) over the transom? Why would other G4-compressed TIFFs work OK?

I mean that Zip and PNG use implementations of Flate, too, but it doesn't mean they are identical in all corner cases.

libtiff doesn't modify the raster data, as far as I know.

PhilterPaper commented 3 years ago

Attacking it from another angle, are there any TIFF validation tools to see if G4.tiff is fully valid? Maybe it has some extra (or short) or slightly malformed raster data that some PDF Readers (like Adobe) object to?

PhilterPaper commented 3 years ago

I just pushed updated TIFF/File_GT.pm and TIFF_GT.pm to GitHub. It should properly display G3 and G4 bilevel faxes (without the inverted black/white). Uncompressed bilevel (e.g., alpha.tif, uncompressed, with alpha layer) is now flipped to black-on-white. Everything displays fine on XpdfReader and Firefox, and all but G4.tiff on Adobe Reader. outGT.pdf

At least this should fix the problem with (most) inverted colors, and I can worry about why G4 doesn't open on Adobe Reader, and whether flipping black/white on uncompressed bilevel (such as alpha.tiff) is good in all cases (or is a one-off).

carygravel commented 3 years ago

Attacking it from another angle, are there any TIFF validation tools to see if G4.tiff is fully valid? Maybe it has some extra (or short) or slightly malformed raster data that some PDF Readers (like Adobe) object to?

libtiff was started as an Adobe project, and as far as I know, is also the only implementation. All the software I know of that uses TIFF, uses libtiff behind the scenes.

carygravel commented 3 years ago

I just pushed updated TIFF/File_GT.pm and TIFF_GT.pm to GitHub. It should properly display G3 and G4 bilevel faxes (without the inverted black/white). Uncompressed bilevel (e.g., alpha.tif, uncompressed, with alpha layer) is now flipped to black-on-white. Everything displays fine on XpdfReader and Firefox, and all but G4.tiff on Adobe Reader.

I confirm that the G4 and LZW examples work for me.

The next question is why we are converting LZW to Flate when PDF supports LZW?

PhilterPaper commented 3 years ago

The next question is why we are converting LZW to Flate when PDF supports LZW?

I have no idea. I'll have to look and see what exactly it's doing. It's not my code (either inherited from earlier non-GT TIFF code, or came from suggested GT code).

Add: I see there is PR #148 open on this. Further discussion there. This ticket should be satisfied on the color invert, but the matter of G4 generating a PDF that fails in AR is still open (I may open a new ticket for that).

PhilterPaper commented 3 years ago

I've opened #149 to deal with the AR problem with G4.tiff. Since it looks like (fingers crossed) this color inversion is now fixed, I'll close this one.

PhilterPaper commented 1 year ago

Cary, just a quick note per our discussion of early 2021, I have created an xt/ test directory. It should have "extended" and "author-only" tests in it that really don't belong in the regular installation testing. t/ tests should be reserved for verifying that required libraries are installed, the complete PDF::Builder was installed, and function basically works. Deeper testing of added function should probably be in the xt/ directory, which doesn't get run at install.