Fix TIFF processing. Add tests to prevent regression in OCR for gif, jpg, jp2, tiff, webp

This PR addresses the fact that some TIFF images were not being OCR'ed correctly.

This stemmed from the fact that, if the TIFF file contained data with JPEG compression, the tiff2pdf command, as it existed before this commit, would generate a PDF with an empty image.

To address this, we have added the -n and -j flags to the tiff2pdf command:

the -n command results in the JPEG-compressed data actually being written to the PDF (according to the tiff2pdf man page this flag sets "no passthrough" option? - not 100% sure on this)
the -j flag sets the compression type and keeps the resulting PDF from being blown up in size

Now, TIFF images that do not contain JPEG-compressed data are not converted correctly to PDF and error out when the -j flag is used. Thus, this PR attempts to first use the -n and -j flags and then tries to run the command without them, in case it fails. This does not work the other way around, because, if there is JPEG-compressed data in the TIFF, the tiff2pdf command does not fail, but instead produces an empty image inside a PDF.

This PR also adds a test that specifically checks a regression in the OCR behaviour described in this PR. It also updates the existing TIFF parsing test because TIFF images get converted into one "Pages" entity with several "Page" entities and it's only the "Pages" entity that has the "mimeType" property set, so the test specifically looks for that entity before asserting whether or not its properties contain the expected values.

alephdata / ingest-file

Fix TIFF processing. Add tests to prevent regression in OCR for gif, jpg, jp2, tiff, webp #587