alephdata / ingest-file

Ingestors extract the contents of mixed unstructured documents into structured (followthemoney) data.
GNU Affero General Public License v3.0
54 stars 25 forks source link

Fix TIFF processing. Add tests to prevent regression in OCR for gif, jpg, jp2, tiff, webp #587

Closed catileptic closed 5 months ago

catileptic commented 6 months ago

This PR addresses the fact that some TIFF images were not being OCR'ed correctly.

This stemmed from the fact that, if the TIFF file contained data with JPEG compression, the tiff2pdf command, as it existed before this commit, would generate a PDF with an empty image.

To address this, we have added the -n and -j flags to the tiff2pdf command:

Now, TIFF images that do not contain JPEG-compressed data are not converted correctly to PDF and error out when the -j flag is used. Thus, this PR attempts to first use the -n and -j flags and then tries to run the command without them, in case it fails. This does not work the other way around, because, if there is JPEG-compressed data in the TIFF, the tiff2pdf command does not fail, but instead produces an empty image inside a PDF.

This PR also adds a test that specifically checks a regression in the OCR behaviour described in this PR. It also updates the existing TIFF parsing test because TIFF images get converted into one "Pages" entity with several "Page" entities and it's only the "Pages" entity that has the "mimeType" property set, so the test specifically looks for that entity before asserting whether or not its properties contain the expected values.

tillprochaska commented 5 months ago

Just for future reference, not necessarily something we need to handle right now. But I found this issue in the libtiff repo which might describe a similar issue: https://gitlab.com/libtiff/libtiff/-/issues/13

When running a JPEG-compressed TIFF file through the tiff2pdf tool (which basically just changes the TIFF wrapper for a PDF wrapper since PDF can have JPEG-compressed image data in it), the resulting PDF file is not viewable in Acrobat Reader, evince, or Ghostscript, although xpdf does handle it fine.

If I understand the referenced issue correctly, it might be that it has been fixed in recent libtiff version. The version we’re using is quite old (4.1.0, released in 2019)