Patent2net / P2N

Last P2N version, working on
14 stars 7 forks source link

FusionImages: PIL/Pillow can't read TIFF image #24

Closed amotl closed 6 years ago

amotl commented 6 years ago

Hi there,

we found that FusionImages.py fails creating thumbnails. The runbook we used to reproduce the problem is:

Recipe

Setup "patent2net" module

For installing the software, please follow the instructions outlined on https://docs.ip-tools.org/patent2net/setup.html.

Setup PIL successor

pip install pillow==5.0.0

Acquire images

export P2N_CONFIG=`pwd`/RequestsSets/Lentille.cql
p2n images

Attempt to read TIFF image

>>> from PIL import Image
>>> im = Image.open('DATA/Lentille/PatentImages/WO2011015115-1.tiff')
>>> im.tobytes()

Exceptions

When using Pillow

tempfile.tif: Cannot read TIFF header.
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/amo/dev/elmyra/sources/P2N/.venv27/lib/python2.7/site-packages/PIL/Image.py", line 720, in tobytes
    self.load()
  File "/Users/amo/dev/elmyra/sources/P2N/.venv27/lib/python2.7/site-packages/PIL/TiffImagePlugin.py", line 1039, in load
    return self._load_libtiff()
  File "/Users/amo/dev/elmyra/sources/P2N/.venv27/lib/python2.7/site-packages/PIL/TiffImagePlugin.py", line 1131, in _load_libtiff
    raise IOError(err)
IOError: -2

When using PIL

Traceback (most recent call last):
  File "Patent2Net\FusionImages.py", line 65, in <module>
  File "Patent2Net\FusionImages.py", line 28, in generate_thumbnails
  File "site-packages\PIL\Image.py", line 2452, in open
IOError: cannot identify image file '..\\DATA\\lentille\\PatentImages\\CN101988689-1.tiff'

Further notices

However, opening the file in question on a Mac OS X machine works fine:

# Show header information
file DATA/Lentille/PatentImages/WO2011015115-1.tiff
DATA/Lentille/PatentImages/WO2011015115-1.tiff: TIFF image data, big-endian, direntries=10, height=0, compression=bi-level group 4, PhotometricIntepretation=WhiteIsZero, orientation=upper-left, width=0

# Open image in "Preview"
open DATA/Lentille/PatentImages/WO2011015115-1.tiff
amotl commented 6 years ago

When working on PatZilla, we actually remember having problems converting TIFF images from the patent data universe with a regular Python PIL library due to some obscure special TIFF features the images might be using.

After digging for the relevant details, we found a comment from the past in the patzilla.util.image.convert.to_png function:

# Unfortunately, PIL can not handle G4 compression.
# Failure: exceptions.IOError: decoder group4 not available
# Maybe patch: http://mail.python.org/pipermail/image-sig/2003-July/002354.html

and the file header says it actually is a "bi-level group 4"-type image:

file DATA/Lentille/PatentImages/WO2011015115-1.tiff
TIFF image data, big-endian, compression=bi-level group 4, [...]

To mitigate the issue, we had to resort to the "convert" tool of ImageMagick fame and never looked back. Let's just go ahead and reuse this recipe from PatZilla in Patent2Net, if you don't have any objections.

rfaga commented 6 years ago

hey @amotl , thanks for checking this issue and also making the PR!

I have a compiled Pillow with full support for TIFF images, at least I could make it work with thousands of images that I tested with EPO. But I don't think it's easy to have it compiled in different environments, specially on Windows, so I think your proposal makes sense for P2N.

amotl commented 6 years ago

Hey @rfaga,

good to know this actually is possible with Pillow. Would you mind sharing your installation instructions for others to reproduce? Maybe i will also give it a try.

Otherwise, if you also think using ImageMagick for the thumbnailing task is a more approachable solution for newcomers, let's polish the PR #25 and use it as the default implementation?

I would keep the Pillow-based implementation and maybe add a toggle switch (environment flag) for choosing between both strategies explicitly. Alternatively we can use the ImageMagick-based strategy as a fallback to the Pillow-strategy implicitly.

With kind regards, Andreas.

rfaga commented 6 years ago

@amotl I actually think Pillow is distributing a compiled version that works with tiff: https://pypi.python.org/pypi/Pillow/5.0.0

I just typed pip install Pillow, and doesn't even check for my -dev packages to compile, and according to https://pillow.readthedocs.io/en/latest/releasenotes/5.0.0.html#compressed-tiff-images it's using libtiff. So probably pip install in any env will work.

But I still think the ImageMagick could be a fallback, maybe we could try to convert first with env PIL and, if we get an error, go with ImageMagick for the following tries. What do you think?

Regards, Roberto

Patent2net commented 6 years ago

I ve experimented ImageMagic 10 years ago. It works fine on all environments and was (at this time) easy to use and configure. But you choose the way

Patent2net commented 6 years ago

Well. I could manage those files. The error wasn't to handle the tiff files but to save them properly : I had to add a "binary" switch. Now it works fine (for the lentille case almost)

amotl commented 6 years ago

Hi there,

it just happened that i installed a fresh release of my OS, from now i will be using Homebrew instead of Macports and all the jazz under the hood (Xcode, etc.) also is up-to-date right now. So, i will give installing Pillow a try whether it does support compressed TIFF images properly now.

The error wasn't to handle the tiff files but to save them properly : I had to add a "binary" switch.

May i humbly ask which amendments you had to make? Then i would add them to the current stream of the "develop" branch / the amendments of #25. Thanks!

Cheers, Andreas.

amotl commented 6 years ago

Using current software actually solved my problem, Pillow on High Sierra made things work perfectly. Thanks for listening.