Decrease reliance on non-Python APIs

This could be streamlined somewhat by using something like tesserocr or pyocr instead of using shell scripts.

Additionally, it would be great if there were a way to extract entities from a PDF without needing to run preprocess.sh to convert each page to an image and run tesseract on it.