UW-xDD / blackstack

Entity extraction from PDFs with Tesseract and Machine Learning
MIT License
11 stars 5 forks source link

Decrease reliance on non-Python APIs #2

Open jczaplew opened 6 years ago

jczaplew commented 6 years ago

This could be streamlined somewhat by using something like tesserocr or pyocr instead of using shell scripts.

Additionally, it would be great if there were a way to extract entities from a PDF without needing to run preprocess.sh to convert each page to an image and run tesseract on it.

Ghostscript - https://stackoverflow.com/a/36113000/1956065