crocs-muni / sec-certs

Tool for analysis of security certificates and their security targets (Common Criteria, NIST FIPS140-2...).
https://sec-certs.org
MIT License
12 stars 8 forks source link

Add OCR #244

Closed J08nY closed 2 years ago

J08nY commented 2 years ago

This adds OCR via the tesseract tool.

This does OCR only on "garbage" PDFs, the "garbageness" is determined using some tests and empirically set thresholds. The tested values are:

Generally if we do OCR on a not-bad PDF we don't lose much if anything as its output is pretty good. The only thing we lose is time, as it takes ages.

I chose to avoid the ocrmypdf package which could do a bunch of stuff for us because it seemed to have a memory leak and when processing the hundreds of garbage PDF reports the memory consumption of the process would grow to very large values.