PhilterPaper / Perl-PDF-Builder

Extended version of the popular PDF::API2 Perl-based PDF library for creating, reading, and modifying PDF documents
https://www.catskilltech.com/FreeSW/product/PDF%2DBuilder/title/PDF%3A%3ABuilder/freeSW_full
Other
6 stars 7 forks source link

These G4 and LZW-encoded TIFFs are negated when embedded in PDF #141

Closed carygravel closed 3 years ago

carygravel commented 3 years ago

tiff.zip

PhilterPaper commented 3 years ago

G4.tiff and LZW.tiff both appear as black text on a white background in ImageMagick, which I presume is the desired disiplay.

Using -nouseGT (old pure-Perl TIFF code), G4 is rejected because its CCITT 6 compression isn't supported. It also has an undefined $tif->{'whiteIsZero'}, which is easily fixed. LZW displays correctly (black on white).

Using the default Graphics::TIFF library, G4 fails to display (insufficient data for image). That will have to be looked into. Did it work for you? Since it displayed fine in ImageMagick, I would say that the file is probably OK. LZW displays white text on black, which I assume is what you mean by "negated" (in English, I think the conventional term is "inverted colors" or "swapped colors", although "negative colors", as in a photographic negative, would be understandable. Be careful not to imply that the "up" direction has changed.).

Looking in the code, I see comments that black/white flag (i.e., BlackIsZero) had to be flipped around in the GT file from what you would expect it to be from its name. As I recall, this was done after the sample bilevel images were displayed inverted. It's possible that no TIFF bilevel file I ever tested was missing 'whiteIsZero' because I never had a PhotometricInterpretation = 0 (whiteIsZero) to test?

As for the failure to decode G4 with the Graphics::TIFF library, perhaps this is the first time that this combination of Fax attributes has been encountered by the GT code? So I should concentrate first on the bilevel black/white flag problem, and see if that fixes the insufficient data problem in the process.

If you have a stock of a number of bilevel files, can you try them out?

PhilterPaper commented 3 years ago

Some further research. Examining the .tiff files in an editor, and looking at their directories, I see that both are PhotometricInterpretation 1, which is BlackIsZero (and white is 1). Flipping the black/white around, as done in the code, would account for the inverted colors on LZW (and probably G4 too, if it were seen) on Graphics::TIFF, but doesn't explain why LZW works with non-GT. So why was it necessary to flip the colors around in the first place? As I recall, none of the bilevel TIFFs (I don't know offhand which were PI=0 and which were PI=1) worked unless black/white were swapped. I'm going to have to keep going over the code with a fine-toothed comb until I figure it out.

Looking directly in the files, G4 is compression method 4 (CCITT T6) and LZW is method 5 (LZW). Both the old pure-Perl code and the new GT code appear to just pass on these settings in the 'filter' and 'ccitt' fields for the PDF Reader to decompress. Interestingly, neither old nor new TIFF code appears to handle compression method 2! And the spec says that methods 1 (no compression), 2 (CCITT 1D RLE), and 32773 (PackBits) must be handled at a minimum! If you know that you have some TIFFs with compression method 2, you might want to try them out. By the way, the TIFF specification warns that compression method 2 with PI=1 (BlackIsZero) should expect the resulting image to be inverted. But, as far as I can see, we're not using method 2.

I don't have a whole lot of TIFF files on hand, particularly bilevel (PI=0 or 1), but as far as I know, both the colors and the compression code have handled everything so far. Do you know if there is anything odd about the two .tiff files you sent? Did you use a new scanner or conversion software? There must be some combination of parameters and settings that ImageMagick and the old pure-Perl code can handle, but either Graphics::TIFF itself or the new _GT image code is dropping the ball.

carygravel commented 3 years ago

Firstly - apologies for the very terse issue. I didn't have much time, and wanted to report the issue before I forgot.

Secondly - I only provided the results - here is the original. I was testing how much worse the LZW and CCITT compression was compared to JBIG2. jbig2.pdf

I extracted the image from jbig2.pdf using pdfimages, recompressed it with libtiff, once with lzw, once with g4 (CCITT), and used PDF::Builder to create the PDF. Both had inverted colours.

G4.tiff and LZW.tiff both appear as black text on a white background in ImageMagick, which I presume is the desired disiplay.

Yup.

Using -nouseGT (old pure-Perl TIFF code), G4 is rejected because its CCITT 6 compression isn't supported. It also has an undefined $tif->{'whiteIsZero'}, which is easily fixed. LZW displays correctly (black on white).

Using the default Graphics::TIFF library, G4 fails to display (insufficient data for image). That will have to be looked into. Did it work for you? Since it displayed fine in ImageMagick, I would say that the file is probably OK. LZW displays white text on black, which I assume is what you mean by "negated" (in English, I think the conventional term is "inverted colors" or "swapped colors", although "negative colors", as in a photographic negative, would be understandable. Be careful not to imply that the "up" direction has changed.).

G4 worked for me with Graphics::TIFF, but the result also had inverted colours.

Looking in the code, I see comments that black/white flag (i.e., BlackIsZero) had to be flipped around in the GT file from what you would expect it to be from its name. As I recall, this was done after the sample bilevel images were displayed inverted. It's possible that no TIFF bilevel file I ever tested was missing 'whiteIsZero' because I never had a PhotometricInterpretation = 0 (whiteIsZero) to test?

My recollections are the same as yours.

As for the failure to decode G4 with the Graphics::TIFF library, perhaps this is the first time that this combination of Fax attributes has been encountered by the GT code? So I should concentrate first on the bilevel black/white flag problem, and see if that fixes the insufficient data problem in the process.

As I said, I was able to create a PDF from G4.tif. I'll post it later.

If you have a stock of a number of bilevel files, can you try them out?

Yes - but we can create our own using imagemagick and libtiff - just as I did here.

carygravel commented 3 years ago

g4.pdf g4_with_GT.pdf

Now that is interesting - without GT, the colours are inverted, with, everything is OK.

PhilterPaper commented 3 years ago

OK, I'm thoroughly confused now. Your 'g4.pdf' matches what I get when I run with the flag -nouseGT=>1 (use the old pure Perl code), and your 'g4_with_GT.pdf' matches what I get with -nouseGT=>0 (the default, run using Graphics::TIFF library). Is that what you ran? Remember, the flag is to NOT use the GT library! Perhaps that's confusing, but it's too late to change the flag name to -useGT and have the default 1.

I still get a failure of "insufficient data for image" from the PDF reader for G4 using Graphics::TIFF. Again, are you sure you're using -nouseGT the correct way? And 'G4.tiff' still fails with pure Perl (old code), as that compression is not supported.

carygravel commented 3 years ago

OK, I'm thoroughly confused now. Your 'g4.pdf' matches what I get when I run with the flag -nouseGT=>1 (use the old pure Perl code), and your 'g4_with_GT.pdf' matches what I get with -nouseGT=>0 (the default, run using Graphics::TIFF library). Is that what you ran? Remember, the flag is to NOT use the GT library! Perhaps that's confusing, but it's too late to change the flag name to -useGT and have the default 1.

Yes, except that I didn't play with the flags, I just removed and reinstalled GT.

I still get a failure of "insufficient data for image" from the PDF reader for G4 using Graphics::TIFF. Again, are you sure you're using -nouseGT the correct way? And 'G4.tiff' still fails with pure Perl (old code), as that compression is not supported.

What versions of libtiff and GT do you have?

PhilterPaper commented 3 years ago

Graphics::TIFF 7, I don't recall if you told me how to check libtiff's version, but we decided it was up to date when we talked about the TIFF alpha support a few months ago. Has it changed since then?

Update: \Strawberry\c\lib\libtiff.a is 155,104 bytes, dated 05/18/2017. \Strawberry\perl\site\lib\auto\Graphics\TIFF\TIFF.xs.dll is 48,640 bytes, dated 10/27/2020

Removing the library (Graphics::TIFF) should have the same result as -nouseGT=>1.

PhilterPaper commented 3 years ago

Well, it's easy enough to back out the "flipped black and white" in 2 places in TIFF_GT.pm, and get LZW.tiff to create a good PDF. What worries me is why the flipping (inverting) was done in the first place! I must have had sample TIFF files that displayed inverted, so I had to add the flips, but I don't seem to have them any more (assuming I ever did!). The only ones I could find (for bilevel) now display correctly (and were incorrect with the "flip" code still enabled). Do you have any thoughts? It would be simple to put the revised code into GitHub if you want to play with it. I'm hoping you have a good supply of bilevel TIFFs (PhotometricInterpretation 0 and 1, and a variety of compressions including various fax formats) to test with.

Tomorrow I'll take a crack at why G4 won't display (insufficient data) with GT, but is OK without GT.

carygravel commented 3 years ago

What worries me is why the flipping (inverting) was done in the first place!

This is why I always like to create a regression test for every bug before I fix it.

I must have had sample TIFF files that displayed inverted, so I had to add the flips, but I don't seem to have them any more (assuming I ever did!). The only ones I could find (for bilevel) now display correctly (and were incorrect with the "flip" code still enabled). Do you have any thoughts? It would be simple to put the revised code into GitHub if you want to play with it. I'm hoping you have a good supply of bilevel TIFFs (PhotometricInterpretation 0 and 1, and a variety of compressions including various fax formats) to test with.

I think it is important to automate this kind of stuff. The options are then:

Up to now, I have always gone for the 3rd option where I could.

Tomorrow I'll take a crack at why G4 won't display (insufficient data) with GT, but is OK without GT.

You said last time you had libtiff 4.0.7. I'm using 4.0.10 on this machine. 4.2.0 is current.

PhilterPaper commented 3 years ago

Something very strange is going on here! As of last night, I had the "flip black/white" backed out of the Builder code, and it produced a good LZW.pdf (although G4.pdf still failed). One old bilevel TIFF I found also output correctly. This morning, I forced a rebuild of Graphics::TIFF (cpanm -f) and there was no change to the output (the TIFF.xs.dll was refreshed, but not the libtiff.a). So, per your suggestion that an old version of libtiff might be the problem, I extracted \Strawberry\c\lib\libtiff.a (two years younger and a bit larger) and \Strawberry\c\bin\libtiff-5__.dll (two years younger and quite a bit larger) from the Perl 5.32 image. Rerunning the tests, G4 still fails, and now LZW and the old test case are inverted again!!! It must be using the new libtiff files, but I'll check again after the next reboot.

I normally do keep some test cases around, particularly if they illustrate a bug, but I was unable to locate anything showing that inverting black and white was necessary in the first place. Somehow I misplaced any such files, or maybe didn't bring them over from my old PC when I replaced it three years ago. I'll dig the old machine out of the closet and see if I can get it running and find any more test files.

I see there is a libtiffxx.a and libtiffxx-5__.dll in \Strawberry\c\lib and bin... they're quite a bit smaller. Any idea what they are, and if they might be interfering with libtiff usage?

Could you refresh my memory on how to determine the libtiff version? Thanks.

carygravel commented 3 years ago

Here's how to determine the libtiff version:

https://github.com/PhilterPaper/Perl-PDF-Builder/issues/133#issuecomment-718221701

I see there is a libtiffxx.a and libtiffxx-5__.dll in \Strawberry\c\lib and bin... they're quite a bit smaller. Any idea what they are, and if they might be interfering with libtiff usage?

libtiffxx.a and libtiffxx-5 will be the C++ interface. I doubt they are a problem for GT or therefore PDF::Builder, as it is all C-based.

PhilterPaper commented 3 years ago

Ah, there it is.

> perl -e "use Graphics::TIFF; print Graphics::TIFF->GetVersion;"
LIBTIFF, Version 4.0.10
Copyright (c) 1988-1996 Sam Leffler
Copyright (c) 1991-1996 Silicon Graphics, Inc.
PhilterPaper commented 3 years ago

OK, The flipped flip has been flipped back... I had changed it back to try something with the new library, and forgot to change it back to the unflipped version afterwards (I need another cuppa joe). Anyway, it's now showing correct "colors" and the G4 issue is the only one still outstanding. I'm trying to get my old laptop recharged to see if there are any other TIFF test files to snag, but it doesn't sound happy while charging (so long as the Li battery doesn't burst into flames... I think I'll bring in a fire extinguisher just to be safe).

carygravel commented 3 years ago

Here's another TIFF with LZW compression that PDF::API2 corrupted in the past.

lzw.zip

PhilterPaper commented 3 years ago

Making some progress here on G4 (with Graphics::TIFF). It is PhotometricInterpretation = 1 (black is 0), which you'd think should end up with a DecodeParms with /BlackIs1 false, but according to g4_with_GT.pdf it should be /BlackIs1 true. Maybe that's what I was basing my black/white "flip" off of before? Different TIFF file, of course. OK, edit my PDF to do that. The stream data (which is CCITTFaxDecode compressed in both PDFs) is identical until the very end. Yours ends 8 bytes sooner than mine, with the last 4 bytes different: FF001001 which I think may be an End-of-Facsimile-Block. Anyway, change the last 12 bytes to your last 4 (and reduce the /Length by 8)... and it works!!

Now, I think that it's going through libtiff to get the stream data (in strips), ReadRawStrips, rather than reading it directly, so who is at fault for not handling the end of the image data correctly? Is it supposed to change the end to an EOFB marker? According to the PDF spec, the use of such a marker defaults to /EndOfBlock true, but adding an explicit /EndOfBlock false (within the DecodeParms) didn't seem to help (with the longer data). Am I doing something wrong?

By the way, dumping the raw data returned by libtiff/GT, I see ImageHeight = 3442, while RowsPerStrip = 3440. That seems a bit odd to me, but I'm no expert on TIFF.

I'll start playing with lzw.zip tonight and see what it does. Any comments or clarifications on the above would be appreciated.

Add: the lzw.zip (IRS 2011 form) displayed fine in PDF. No problems with it.

carygravel commented 3 years ago

Another tif from an old bug:

https://sourceforge.net/p/gscan2pdf/bugs/_discuss/thread/68484c3a/5365/b860/attachment/page%203.tif

carygravel commented 3 years ago

And the ones that got me started on GT:

https://rt.cpan.org/Ticket/Attachment/1671661/897157/1.tiff

https://rt.cpan.org/Ticket/Attachment/1671661/897158/8.tiff

carygravel commented 3 years ago

Making some progress here on G4 (with Graphics::TIFF). It is PhotometricInterpretation = 1 (black is 0), which you'd think should end up with a DecodeParms with /BlackIs1 false, but according to g4_with_GT.pdf it should be /BlackIs1 true. Maybe that's what I was basing my black/white "flip" off of before? Different TIFF file, of course. OK, edit my PDF to do that. The stream data (which is CCITTFaxDecode compressed in both PDFs) is identical until the very end. Yours ends 8 bytes sooner than mine, with the last 4 bytes different: FF001001 which I think may be an End-of-Facsimile-Block. Anyway, change the last 12 bytes to your last 4 (and reduce the /Length by 8)... and it works!!

Of course the other difference between your setup and mine is Windows/Linux. This isn't a CR/LF problem, is it, given that I don't get your error message?

PhilterPaper commented 3 years ago

Thank you for the additional TIFF test cases. I will consolidate all my samples into one place.

By the way, I see your two g4*.pdf files were built using PDF::Builder 3.019, rather than the current 3.021. In-between were a lot of changes to TIFF handling (with Graphics::TIFF) for colormap problems and alpha channel support. Could you try with current PDF::Builder and see if you now get problems? That might at least narrow down the area.

PhilterPaper commented 3 years ago

I've been going over this thing (that G4 displays OK for you and fails for me) all afternoon and evening, and need to take a break. Have you had a chance to find out whether you're running pure PDF::Builder 3.019, pure 3.021, or a mixture of the two? The PDF you sent says it was produced on 3.019, but there have been a lot of changes to TIFF handling since then.

The G4 TIFF file says it is PhotometricInterpretation 1, where white is 1 value and black is 0. That sounds like it should be white content on a black background, but I can't check because I don't have an uncompressed image at any point in the process. The Compression Method is 4, so it shouldn't be inverted (as with 2). I don't understand why BlackIs1 needs to be flipped around (to true) -- the same hangup where I had to flip black and white before without knowing why. Any further exploration on that will have to wait until I get an image without error (insufficient data). As I said before, your image data is a bit shorter and seems to have an EOFB marker, while mine doesn't. I don't do that in the code, so it must be something your libtiff is delivering but mine isn't! I don't want to do an ad-hoc "fix" of replacing my last 12 bytes with your last 4 bytes, unless I really understand when this is applicable.

Something I missed yesterday -- after the /Length and /BlackIs1 changes, and changing the end of the image stream, Reader asks me if I want to save the PDF file when closing it. That means that it corrected an error, so my changes are not enough. :-(

carygravel commented 3 years ago

I test with different machines with different Linux distros. This one has a new PDF::Builder. Here, with GT, I get (inverted):

g4_with_GT.pdf

Without GT, I get the error Chunked CCITT G4 TIFF not supported.

PhilterPaper commented 3 years ago

I get the error "insufficient data for an image" when I try to display it. I'm using Adobe Acrobat Reader DC 20.013.20074.

This displays OK (except inverted black/white) for you? Yet it crashes for me (as does one I produce with PDF::Builder). Here's what I get for G4.tiff (try it on your Reader): outGT.pdf

Here's some debug:

C:\Users\Phil\Desktop>invert.pl 0
file: /Users/Phil/Desktop/PDF-samples/TIFF/bilevel/G4.tiff, 303.625 x 430.25
Compression method 4, better not be 2!
PMI is 1 (whiteIsZero=0)
$tif contents returned from File_GT
$tif->{'ExtraSamples'} = ?
$tif->{'RowsPerStrip'} = '3440'
$tif->{'SamplesPerPixel'} = '1'
$tif->{'bitsPerSample'} = '1'
$tif->{'blackIsZero'} = '1'
$tif->{'ccitt'} = '4'
$tif->{'colorMap'} = 'ARRAY(0x3b87620)'
  colorMap=
$tif->{'colorSpace'} = 'DeviceGray'
$tif->{'fillOrder'} = '1'
$tif->{'filter'} = 'CCITTFaxDecode'
$tif->{'g3Options'} = ?
$tif->{'g4Options'} = ?
$tif->{'imageDescription'} = ?
$tif->{'imageHeight'} = '3442'
$tif->{'imageId'} = ?
$tif->{'imageLength'} = '8'
$tif->{'imageOffset'} = '53727'
$tif->{'imageWidth'} = '2429'
$tif->{'lzwPredictor'} = ?
$tif->{'object'} = 'Graphics::TIFF=SCALAR(0x3b8f928)'
$tif->{'resUnit'} = '3'
$tif->{'whiteIsZero'} = '0'
$tif->{'xRes'} = '118.5'
$tif->{'yRes'} = '118.5'
enter read_tiff()
colorspace(DeviceGray)
in handle_ccitt
whiteIsZero = 0
 DecodeParms.Columns 2429, .Rows 3442, .BlackIs1 0
ending handle_ccitt()
Graphics::TIFF library flag: 1 (1=using GT)

Anything catch your eye? The RowsPerStrip, imageLength, and imageHeight don't quite make sense to me, but I can't quite put my finder on it. Then there's the issue of why it normally comes out black-on-white, when the directory says that black is zero (white-on-black).

carygravel commented 3 years ago

I seem to have misunderstood you. I thought you were saying that your insufficient data for an image error was given by PDF::Builder when creating the PDF. Now I see you mean that the error is produced by Acrobat Reader, which I don't have. I'll see if I can provoke an error from other readers.

PhilterPaper commented 3 years ago

Ah, I thought that error message was well known as coming from AR. Sorry for the confusion. Since it's the gold standard for PDF Readers, it's bad if a file does not render correctly on it. I'd be curious if other Readers worked OK (especially if they do not re-save a fixed-up PDF file). A fix for this error was released last August, but Adobe tells me my copy is up-to-date. I'm going on the assumption that the problem is in PDF::Builder and the changes over the last few months for TIFF colormap and alpha. XpdfReader shows the image, but inverted (white on black). I installed AsTiffTagViewer, and it says that G4.tiff has imageLength of 3442, not 8, so that bears further investigating (libtiff problem?).

I finally managed to pull a few PDFs for testing off my old laptop. The battery only lasts about a half hour, even with the AC adapter plugged in, so I'm afraid that machine isn't long for this world. I'll have to see if I can extract the HDD and maybe put it in a USB-connected external case. With a bad battery, bad AC adapter, and bad display backlight, the rest of it probably is a lost cause. Then Tuesday I'll have as much as 24 inches (60cm) of snow to clear away!

carygravel commented 3 years ago

I'm running out of ideas.

It would surprise me if the problem was with libtiff, as this is the reference implementation for TIFF - not to say there are no bugs, of course there are.

carygravel commented 3 years ago

Documentation for the TIFF tags is here:

https://awaresystems.be/imaging/tiff/tifftags/baseline.html

Note there is no imageHeight tag. This should be imagelength.

ppisar commented 3 years ago

I noticed this issue when performing PDF-Build's t/tiff.t test:

not ok 10 - G3 (not converted to flate)
#   Failed test 'G3 (not converted to flate)'
#   at t/tiff.t line 154.
#          got: '# ImageMagick pixel enumeration: 1,1,65535,gray
# 0,0: (7.1751,7.1751,7.1751)  #070707  gray(2.81376%)
# '
#     expected: '# ImageMagick pixel enumeration: 1,1,65535,gray
# 0,0: (248.008,248.008,248.008)  #F8F8F8  gray(97.258%)
# '
# Looks like your test exited with 1 just after 10.

The same happens for the LZW test.

Inspecting the images used in the test shows that test.tiff after conversion by tiffcp has a black text on a white background as expected. But when extracting images from the resulting test.pdf with "pdfimages tool", the image has inverted colors.

I have libtiff-4.1.0, Graphics-TIFF-7, ImageMagick-6.9.11.27, poppler-0.84.0. I tried PDF-Builder-3.021 as well as latest git PDF-Builder.

PhilterPaper commented 3 years ago

Note there is no imageHeight tag. This should be imagelength.

Agreed. Both AsTiffTags and libtiff show ImageWidth (imageWidth in libtiff). However, AsTiffTags shows ImageLength as the height (count of rows), while libtiff shows imageHeight as the height, and uses "imageLength" for something else. I should probably dive into the PDF::Builder code and make sure I didn't accidentally carry over "ImageLength" usage into the libtiff version instead of using "imageHeight". I haven't seen anything suggesting such a mixup, but who knows?

PhilterPaper commented 3 years ago

The problems seem to be only with bilevel TIFFs, and only with G3 or G4 compression.

* Firefox has a PDF viewer built in. 

* evince/poppler ditto

* I got a colleague with a Windows machine to try Foxit on it. 

I have displayed the TIFF files with several different viewers, and they all give black text/drawing on a white background. I have tried several PDF viewers, including Adobe Acrobat Reader DC, Xpdf, and Firefox. In all cases LZW-compressed images display black-on-white. In all cases, G3 and G4 compressed images display white-on-black. G4.tiff fails to display at all on Adobe.

It would surprise me if the problem was with libtiff, as this is the reference implementation for TIFF - not to say there are no bugs, of course there are.

Other than the odd imageLength/imageHeight business, at this point I think we can rule out problems with libtiff.

Regarding G4.tiff on Adobe Reader, even though that Reader is the Industry Standard, I must reluctantly conclude that this is a bug in it. Unless someone has other information, I think we should stop wasting time on trying to fix it. By any chance, do you have an account to report bugs on Adobe products?

Finally, regarding white-on-black swapping, this is probably why I had the ad-hoc swap in before. I could check if it's G3 or G4 compression, and do a swap (/Decode [1 0] or the like) for just those, but I'm still a bit uncomfortable not knowing why it's needed. My CCITT_x.TIF files are PhotometricInterpretation WhiteIsMin (white 0), yet they are flipped. G4.tiff and G31D.TIF are black=0, so really they would seem to be displaying properly (white-on-black), although the TIFF viewer displays black-on-white. Are you comfortable with flipping just G3 and G4 bilevel TIFFs, even though no one can explain why it's needed?

PhilterPaper commented 3 years ago

I noticed this issue when performing PDF-Build's t/tiff.t test:

not ok 10 - G3 (not converted to flate)

Failed test 'G3 (not converted to flate)'

at t/tiff.t line 154.

got: '# ImageMagick pixel enumeration: 1,1,65535,gray

0,0: (7.1751,7.1751,7.1751) #070707 gray(2.81376%)

'

expected: '# ImageMagick pixel enumeration: 1,1,65535,gray

0,0: (248.008,248.008,248.008) #F8F8F8 gray(97.258%)

'

I have libtiff-4.1.0, Graphics-TIFF-7, ImageMagick-6.9.11.27, poppler-0.84.0. I tried PDF-Builder-3.021 as well as latest git PDF-Builder.

@carygravel could you look at this one? Test 10 is one I have to skip on my non-Linux (Windows) box. It's possible that this one might be cured by the black/white swap on bilevel, but I'm not sure. Maybe set it aside until the black/white issue for G3 and G4 is resolved.

PhilterPaper commented 3 years ago

@carygravel could you please keep an eye on the t/tiff.t test(s) that are skipped due to TIFF+alpha handling problems, and let me know when they should be re-enabled? There's at least one that requires "convert", that I can't test. Someone reported a test failure on CPAN against it (test with "caption:Lorem ipsum"), which sounds to me like a "convert" problem (possibly we need to specify a "convert" version?). Thanks.

carygravel commented 3 years ago

143 should get the imagemagick tests working with Windows, too.

carygravel commented 3 years ago

Test 10 G3 (not converted to flate) is now failing on my machine, too for the same reason. I'm guessing that it is the same black/white issue as this one (#141).

What I don't understand is it evidently did pass once, or I wouldn't have created it. Whatever. As long it is fixed now, too.

PhilterPaper commented 3 years ago

I don't know if I can fix G4 so that Adobe accepts it, but I can probably deal with the G3 and G4 black/white issue by swapping (/Decode [1 0] or the like). I've been hoping that someone could come up with a rationale for this, so I wouldn't have to make an ad-hoc "fix" with no reasoning behind it. Should I go ahead and do that?

ppisar commented 3 years ago

I did a few tests and it seems that the color inversion was triggered with 80d0f27e342d9e3e590f527e2e94d0ca23bf5c7b commit. Then I looked on how tiff2pdf tool from libtiff creates PDF documents and discovered that BlackIs1 parameter differs in the PDF output comparing to PDF::Builder. Finally I found out that this patch fixes it:

--- a/lib/PDF/Builder/Resource/XObject/Image/TIFF_GT.pm +++ b/lib/PDF/Builder/Resource/XObject/Image/TIFF_GT.pm @@ -640,12 +640,6 @@ sub handle_ccitt { sub readtiff { my ($self, $pdf, $tif, %opts) = @;

To validate the patch it is necessary to test uncompressed image, LZW-compressed image, and G3-compressed ones. Originally I wanted to correct this nonsense:

# not sure why whiteIsZero needs to be flipped around???
$decode->{'BlackIs1'} = PDFBool($tif->{'whiteIsZero'} == 0? 1: 0);

But it broke LZW-compressed images. Thus I went with removing the blackIsZero/whiteIsZero inversions which helped in all cases.

By the way with my tests I noticed that converting G3 images with tiff2pdf produces much smaller (in the number of bytes) PDF files than PDF-Builder.

PhilterPaper commented 3 years ago

I did already remove those lines (one of the ad-hoc "flips" that I don't understand why they are needed), but left the one later in read_tiff concerning not CCITTFaxDecode and whiteIsZero to output /Decode [1 0] (another flip). So other than that, my code (not yet pushed to GitHub) should be free of ad-hoc "flips". The result is that all CCITT compression (G3 and G4) came out white-on-black (inverted), both for PMI=0 (white is 0) and PMI=1 (black is 0). I had to add a /Decode [1 0] to fix these with my TIFF test suite. The only thing I can figure at this point is that libtiff is hiding the fact that Compression is 2, which I saw somewhere means that the image must be inverted.

Also, the alpha.tif example you sent me a few months ago, even though it is uncompressed, not G3 or G4, it needs to be inverted (flipped) with /Decode [1 0] to come out black-on-white. I haven't been able to figure out what combination of settings to look for.

My current status is that everything displays fine (black-on-white), with a flip for all G3 and G4 and alpha.tif by name. I'm not very happy about all CCITTFaxDecode G3/G4 needing to be flipped, without understanding why (Compression 2?). And of course, it won't do to be flipping certain bilevel files by name!

I'm still researching TIFF files, PhotometricInterpretation, Compression, and anything else that may have an impact on this. All suggestions are welcome. I really need to know why a flip (/Decode [1 0]) is needed, in exactly which cases.

PhilterPaper commented 3 years ago

I suppose a new flag (option) -invert => 1 could be added, but really, I should be able to figure that out from just the tags in the TIFF file. Should all CCITT-compressed (G3, G4, ?) bilevel get flipped, and if so, why? Assuming libtiff itself isn't messing with my mind and returning inconsistent data, what am I missing?

Then there's still the issue of why G4.tiff won't produce a good PDF. I wonder if it has something to do with Adobe following the standard for "EndOfBlock" marker use defaulting to "true", while other readers figure their way around it from the data.

I just saw that there is a new Graphics::TIFF released. It doesn't sound like anything was fixed for this issue, but I'll install it and see if it has any effect.

carygravel commented 3 years ago

The new version of GT was just to deal with some small differences in libtiff 4.2.0. It won't affect this problem.

PhilterPaper commented 3 years ago

I have ten bilevel TIFF files to test with. The four LZW-compressed images (LZW, d4D..., 1, page_3) display fine (black artwork on a white background). The five G3- and G4-compressed images (G4, CCITT_1, CCITT_2, CCITT_3, G31D) all require that black and white be swapped (flipped). G4 is PMI 1 (black is 0), while the others are PMI 0 (white is 0). I'm ready to believe that G3 and G4 are actually TIFF Compression = 2 (CCITT), which under some circumstances can come out flipped (white on black). Note that there is no code for handling Compression = 2, suggesting that it's hidden in something else. The documentation mentions that Compression = 2 with PMI = 1 can be expected to flip the colors, but it also seems to do this for PMI = 0. Let's assume that libtiff and Graphics::TIFF are doing something a little unexpected behind the scenes, resulting in a need to flip the results. My current code does flip black and white for G3 and G4 -- I'm still not comfortable with that justification, but unless someone can show a new G3 or G4 TIFF that doesn't require flipping, I'm prepared to go forward with this in the next release. I would probably add a new option -invert => 1 to force a flip the other way (i.e., G3 and G4 would not be flipped), just to cover all our bases.

That takes care of 5 of the troublemakers. The sixth is alpha.tif, which has a transparency (alpha) channel, is uncompressed, PMI = 1, yet also needs to be flipped. I have no (flimsy) justification for this, as with G3 and G4. Could the alpha channel be doing something odd? I'm really reluctant to go arbitrarily flipping uncompressed + PMI =1 + alpha channel (or some subset of those) unless I have a better idea of what's going on. -invert => 1 could be used to force flipping, but that's not a very good solution.

Finally, we have the problem with the G4.tif file (in addition to needing to be flipped) not displaying (error) in Adobe Acrobat Reader DC (and possibly other Adobe products). On the other hand, XpdfReader and Firefox (browser) both display G4 just fine (still needing to be flipped). My best guess would be that Adobe is stricter about something in the PDF than the others are. Since Adobe is generally considered the "gold standard", it would be good to find out what's wrong with the PDF and fix it.

Add: Your successful g4_with_GT.pdf, while created with Graphics::TIFF, was built on version 3.019, before all the changes for alpha support in TIFF. So, somewhere in 3.020 or 3.021 there must be a minor difference that Adobe chokes on but other PDF readers can get around.

carygravel commented 3 years ago

Add: Your successful g4_with_GT.pdf, while created with Graphics::TIFF, was built on version 3.019, before all the changes for alpha support in TIFF. So, somewhere in 3.020 or 3.021 there must be a minor difference that Adobe chokes on but other PDF readers can get around.

I have posted two versions of g4_with_GT.pdf, one build with v3.019 and one with 3.021. Are you saying that the older version produced a G4-encoded PDF that AR reads without error, and the newer version produces the error?

Are you also saying that you cannot replicate my old and new PDFs with Windows?

We can use git bisect to find the commit that introduced the bug.

PhilterPaper commented 3 years ago

I have posted two versions of g4_with_GT.pdf, one build with v3.019 and one with 3.021. Are you saying that the older version produced a G4-encoded PDF that AR reads without error, and the newer version produces the error?

Yes, the 3.019 version appears to be good on all three readers (Adobe Acrobat Reader DC, Firefox browser, XpdfReader), while 3.021 blows up on Adobe (insufficient image data) and is inverted on the other two.

Are you also saying that you cannot replicate my old and new PDFs with Windows?

I think I went back with 3.019 TIFF code and was able to create a good PDF, but 3.021 produces the unreadable (on Adobe) PDF. So, I would say I can replicate your PDFs.

We can use git bisect to find the commit that introduced the bug.

I already compared the 3.019 and 3.021 TIFF files (TIFF_GT and File_GT) line by line, and found nothing that looks like it should produce a problem. The next step will have to be finding which commit did the evil deed. I'm more worried about the Adobe Reader error. The inverted black/white is less of a problem, even though at this point I'm uncomfortable that it's still something of an ad-hoc fix rather than deeply understanding the issue.

Lately I've been concentrating on a major problem with PDF::Table, and haven't been able to devote much time to PDF::Builder. That and clearing repeated snowfalls from my driveway! If you want to go ahead and bisect to find the commit, that would be appreciated.

carygravel commented 3 years ago

As I don't have Acrobat Reader available, I can't check the result there. If you are saying the only difference is the length of the stream, I can git bisect to find that.

PhilterPaper commented 3 years ago

As I mentioned some time ago, there is a difference in the end of the image stream. Your working one is actually shorter than the one which gets the insufficient data error, and has some differences. I'm wondering if the working one has an EOFB marker and the other doesn't, but I don't think there was ever code to insert a marker. Maybe it's something that libtiff does when it returns the raster data, and it's changed with different libtiff versions? The different end wasn't the only change, as I recall, since AR still made some sort of repair that required saving the PDF after viewing.

carygravel commented 3 years ago

I see that (at least in github) there are no tags. It would be really useful, especially in cases like this, to tag important releases. Would you mind doing so, or if you have already done it, push the tags to github?

carygravel commented 3 years ago

This code:

#!/usr/bin/perl
use warnings;
use strict;
use PDF::Builder;

my $pdf = PDF::Builder->new(-file => 'g4.pdf');
my $tiff = 'G4.tiff';
my $page = $pdf->page;
$page->mediabox(2429*72, 3442*72);
my $gfx = $page->gfx;
my $img = $pdf->image_tiff($tiff);
$gfx->image($img, 0, 0, 2429*72, 3442*72);
$pdf->info(Producer=>'PDF::Builder');
$pdf->save;
$pdf->end;

plus git bisect showed me (as you said before - I don't know what I was thinking in this comment) that 3.019 produces a good PDF (at least evince shows it correctly), and that this commit introduce the problem, which also increased the file size by one byte:

commit 80d0f27e342d9e3e590f527e2e94d0ca23bf5c7b
Author: Phil Perry <phil4597@catskilltech.com>
Date:   Sun Dec 20 16:51:51 2020 -0500

    working on TIFF+alpha display

 Changes                                           |   6 +-
 lib/PDF/Builder/Basic/PDF/File.pm                 |   2 +-
 lib/PDF/Builder/Resource/XObject/Image/TIFF_GT.pm | 195 ++++++++++++++++++++--
 3 files changed, 184 insertions(+), 19 deletions(-)

I'll go through that commit next and see if I can isolate what caused:

carygravel commented 3 years ago

Backing out this part of the commit was enough to fix both problems (which is also what Petr said in this comment):

+    # not sure why blackIsZero needs to be flipped around???
+    if (defined $tif->{'blackIsZero'}) {
+        $tif->{'blackIsZero'} = $tif->{'blackIsZero'} == 1? 0: 1;
+        $tif->{'whiteIsZero'} = $tif->{'blackIsZero'} == 1? 0: 1;
+    }

It is not clear to me why this changes the file size.

PhilterPaper commented 3 years ago

/BlackIs1 false replaced /BlackIs1 true ?

The color-flip problem is minor compared to why Adobe Reader barfs on the G4 output, although I'd really like to know why G3 and G4 compression, and uncompressed (in at least one case) need to flip the colors.

carygravel commented 3 years ago

/BlackIs1 false replaced /BlackIs1 true ?

Aha. Yup. That makes sense.

That implies that you expect Adobe Reader to complain about g4.pdf but not g4-3.019.pdf. With evince, they look identical to me. Internally, they differ, but I haven't checked how.

If you can confirm that, I'll investigate further.

PhilterPaper commented 3 years ago

This is getting stranger and stranger. I replaced TIFF/File_GT.pm and TIFF_GT.pm with their 3.019 versions (and even their 3.018 versions), and G4.tiff still gives a PDF that Adobe can't open (but XpdfReader and Firefox can). I've looked through all the changed files since 3.019 and can't see anything in any other file that could possibly have any effect on image display.

Maybe I will have to archive my current (3.021+ code) and force an install of 3.019 (or even 3.018) to confirm that G4.tiff was working before all the changes to TIFF for alpha channel. Even then, since I've upgraded Graphics::TIFF from 6 to 9 and libtiff from 4.0.7 to 4.0.10, maybe something in the returned image has changed. The 3.019 version of g4_with_GT.pdf you sent works for me, but I don't know what other changes you've made. Maybe gscan2pdf made some subtle change that fixed whatever problem I'm having?

Speaking of which, as I detailed earlier, the current PDF raster image data is unchanged except for the last couple dozen bytes. I don't see any code in PDF::Builder that should be changing the returned raster data, so I must conclude that libtiff and/or Graphics::TIFF is doing something a bit different. I presume you're familiar with the changes to both -- could there be anything changed there? Or, gscan2pdf did something.

I already have code in (but not pushed to GitHub yet) to properly display (flip to proper usage) black and white for CCITT G3 and G4. As I said before, your alpha.tiff (uncompressed) also needs the same handling, but I don't know if that applies to all uncompressed TIFFs. I hate to put in black/white flipping for all uncompressed TIFFs on just an ad-hoc basis (and on a sample size of one!).

Have you looked for an Adobe Acrobat Reader for Linux? Supposedly, Adobe discontinued development some time ago, but the old Readers are still usable on current Linux platforms. At least, we could confirm that you're seeing the same display error that I'm seeing (maybe).

carygravel commented 3 years ago

The only code I am using for these tests ATM is in this comment, which I have just updated to prevent Builder from putting its version number in the Producer metadata to be able to compare bit-for-bit.

Having done so, and with this patch backed out, Builder now produces an identical PDF as from 3.019.

Just for comparison g4-3.019.pdf is from 3.019, g4-3.021.pdf is from the current HEAD in github, and g4-patched.pdf is the same, but with the above patch backed out.

Adobe stopped supporting Acrobat Reader for Linux over 10 years ago. There are installation instructions here, but it has old dependencies that recent Linux distros don't have, so I can't install it.

This online validator, reports for the 3.021 version:

Validating file "g4-3.021.pdf" for conformance level pdf1.4
The image's sample stream's computed length 1046368 is different to the actual length 1045760.
The document does not conform to the requested standard.
The file format (header, trailer, objects, xref, streams) is corrupted.
The document does not conform to the PDF 1.4 standard.

But it reports the same errors for 3.019:

Validating file "g4-3.019.pdf" for conformance level pdf1.4
The image's sample stream's computed length 1046368 is different to the actual length 1045760.
The document does not conform to the requested standard.
The file format (header, trailer, objects, xref, streams) is corrupted.
The document does not conform to the PDF 1.4 standard.