dragon66 / icafe

Java library for reading, writing, converting and manipulating images and metadata
Eclipse Public License 1.0
204 stars 58 forks source link

TIFFTweaker duplicates image data for type 6 TIFFs #49

Closed gcmvy3 closed 5 years ago

gcmvy3 commented 7 years ago

Context: I wrote a program using icafe that splits one large multipage TIFF into several smaller multipage TIFFs. I wanted to do this without decompressing the TIFFs (they mostly use type 6 JPEG compression), so I used the copyPages function from the TIFFTweaker class.

The problem: The program worked, but the files it created were exactly twice the size they were supposed to be. After much debugging I discovered that for each page in the TIFF, there were two sets of identical image data in the file. However, only one set of image data was actually referenced by an IFD, so the extra set did not show up in any TIFF viewer or analyzer.

The solution: I tracked down the problem to the copyPageData function. I realized that if the TIFF file uses type 6 JPEG compression, the image data is copied twice. I solved this by adding two conditionals:

  1. I added an if statement near line 460 of TIFFTweaker. I check if the image uses type 6 compression, and if it does I skip this whole block. There is no need to copy the image data here because if the compression is type 6, it will be copied later in the function.
  2. I added an if statement near line 480 of TIFFTweaker. I check if the image uses type 6 compression, and if it DOESN'T, I skip this block. The idea is to prevent non-type 6 files from copying extra data. I'm not sure if this is necessary or not.

This problem was tricky to debug, so I figured I would post it here for reference. My solution is not the most elegant, but it works. If you have any questions or suggestions just let me know.

dragon66 commented 7 years ago

@BobsCandyCanes You are talking about the so called old-style JPEG compression type 6 which no longer supported by new TIFF writer.

This compression can produce a lot of confusions and there could be duplicate tags which point to the same data.

This is what happened in your case: the normal TIFF strip/tile and the JPEGInterchangeFormat tag both point to the same data. In TIFFTweaker copyPageData function, they are all copied as is to be on the safe side. That could produce the duplication data you mentioned.

In you change, you seemed to by pass the normal strip/tile data copying and stick to the second part of the JPEGInterchangeFromat raw JPEG stream copying. This may make the resulting image unrecognizable by some TIFF readers as all TIFF readers would expect to see them except in embedded EXIF data. By the way, I don't think the second check will be necessary as if it's not a type 6 compression, no JPEGInterchangeForamt tag should be present and there will be no data copying for a second time.

TIFFTweaker implementation is also used to extract thumbnail from JPEG EXIF data. In that case, it will use the JPEGInterchangeFormat tag to locate and extract raw JPEG thumbnail data from a JPEG image.

dragon66 commented 5 years ago

@gcmvy3 I checked in a fix for this issue.