This could be streamlined somewhat by using something like tesserocr or pyocr instead of using shell scripts.
Additionally, it would be great if there were a way to extract entities from a PDF without needing to run preprocess.sh to convert each page to an image and run tesseract on it.
This could be streamlined somewhat by using something like tesserocr or pyocr instead of using shell scripts.
Additionally, it would be great if there were a way to extract entities from a PDF without needing to run
preprocess.sh
to convert each page to an image and run tesseract on it.Ghostscript - https://stackoverflow.com/a/36113000/1956065