Add OCR - Githubissues

This adds OCR via the tesseract tool.

This does OCR only on "garbage" PDFs, the "garbageness" is determined using some tests and empirically set thresholds. The tested values are:

number of lines of text
average line length
total text file size
ratio of alphanumeric characters in text
number of lines where every second character is a space (ANSSI reports are bad in this way). See the ocr.ipynb notebook for an exploration on these detectors.

Generally if we do OCR on a not-bad PDF we don't lose much if anything as its output is pretty good. The only thing we lose is time, as it takes ages.

I chose to avoid the ocrmypdf package which could do a bunch of stuff for us because it seemed to have a memory leak and when processing the hundreds of garbage PDF reports the memory consumption of the process would grow to very large values.

crocs-muni / sec-certs

Add OCR #244