This does OCR only on "garbage" PDFs, the "garbageness" is determined using some tests and empirically set thresholds. The tested values are:
number of lines of text
average line length
total text file size
ratio of alphanumeric characters in text
number of lines where every second character is a space (ANSSI reports are bad in this way).
See the ocr.ipynb notebook for an exploration on these detectors.
Generally if we do OCR on a not-bad PDF we don't lose much if anything as its output is pretty good. The only thing we lose is time, as it takes ages.
I chose to avoid the ocrmypdf package which could do a bunch of stuff for us because it seemed to have a memory leak and when processing the hundreds of garbage PDF reports the memory consumption of the process would grow to very large values.
This adds OCR via the
tesseract
tool.This does OCR only on "garbage" PDFs, the "garbageness" is determined using some tests and empirically set thresholds. The tested values are:
ocr.ipynb
notebook for an exploration on these detectors.Generally if we do OCR on a not-bad PDF we don't lose much if anything as its output is pretty good. The only thing we lose is time, as it takes ages.
I chose to avoid the
ocrmypdf
package which could do a bunch of stuff for us because it seemed to have a memory leak and when processing the hundreds of garbage PDF reports the memory consumption of the process would grow to very large values.