alephdata / ingest-file

Ingestors extract the contents of mixed unstructured documents into structured (followthemoney) data.
GNU Affero General Public License v3.0
54 stars 25 forks source link

Release/3.20.1 #585

Closed stchris closed 6 months ago

stchris commented 6 months ago

A regression has affected the ability to OCR certain image types.

tesserocr, which ingest-file uses for OCR, is displaying a surprising piece of behaviour which Aleph users have also noticed - they could no longer OCR JPEG images. This is due to the fact that the pre-compiled binaries aren't compiled with jpeg support anymore, nor support for a few other file formats.

This PR forces ingest-file to build tesserocr instead of using the binary wheel, and adds a JPEG test that can catch the regression.

This PR also introduces the ingestors clear-cache command, which takes a prefix and can delete all ingest cache entries.