Closed geramirez closed 9 years ago
Took a look at pdftotext and Apache Tika. Tika is slower, but it seems that its able to extract on average .5% more words.
Looking at what you've got in the analysis, I'm concerned about:
WORDS = re.compile('[A-Za-z]{3,}'
That allows for words that contain punctuation right?
It shouldn't capture punctuation. So nothing in it's
should match, but it would capture people
if it's written as people's
Ah, that makes sense - and works. Thanks for figuring this out.
Some other things to add to this:
Are there any command line arguments to apache tika you'd recommend? Any special installation instructions?
Does Tika do a better job on some of those FBI documents that we know don't have a decent corresponding text extracted? Put another way, are our pdf to text tools better?
I try to answer these questions in another document. I've also been testing a couple other tools pdf2txt.py
, calibre
, ghostscript
, and tesseract
.
@khandelwal Starting to put a document together here while running the script on the 200 doc sample and testing options.
I found the best way to run Tika is to start the server:
java -jar tika-app-1.7.jar --server --text --port 9998
feed in documents using netcat, and write the results into the text file
nc localhost 9998 < document.pdf > document.txt
Without running the server, Tika will open the Java console for each document slowing down the process.
Tesseract is a little complicated to run with PDFs because it doesn't inherently process PDF files. PDF files must first be converted into an image file and then send to Tesseract. For testing I used this shell script to convert PDFs into images (using GS) and then into text.
Detecting responsive documents was more difficult than I thought originally. Occasionally, PDFs which have not been ORCrd will have some responsive text. Hence will probably be important to test a "words extracted" threshold to trigger using Tesseract.
Quick Analysis Here
Setup and research for pdf to text tools
Possible tools: pdftotext Apache Tika pdf2txt.py calibre Ghostscript pdf2line Tesseract & GS method pdfbox