18F / 2015-foia-hub

A consolidated FOIA request hub.
Other
49 stars 17 forks source link

Research PDF text extraction tools #473

Closed geramirez closed 9 years ago

geramirez commented 9 years ago

Quick Analysis Here

Setup and research for pdf to text tools

Possible tools: pdftotext Apache Tika pdf2txt.py calibre Ghostscript pdf2line Tesseract & GS method pdfbox

geramirez commented 9 years ago

Took a look at pdftotext and Apache Tika. Tika is slower, but it seems that its able to extract on average .5% more words.

Quick Analysis Here

khandelwal commented 9 years ago

Looking at what you've got in the analysis, I'm concerned about:

WORDS = re.compile('[A-Za-z]{3,}'

That allows for words that contain punctuation right?

geramirez commented 9 years ago

It shouldn't capture punctuation. So nothing in it's should match, but it would capture people if it's written as people's

khandelwal commented 9 years ago

Ah, that makes sense - and works. Thanks for figuring this out.

khandelwal commented 9 years ago

Some other things to add to this:

Are there any command line arguments to apache tika you'd recommend? Any special installation instructions?

khandelwal commented 9 years ago

Does Tika do a better job on some of those FBI documents that we know don't have a decent corresponding text extracted? Put another way, are our pdf to text tools better?

geramirez commented 9 years ago

I try to answer these questions in another document. I've also been testing a couple other tools pdf2txt.py , calibre, ghostscript, and tesseract.

geramirez commented 9 years ago

@khandelwal Starting to put a document together here while running the script on the 200 doc sample and testing options.

geramirez commented 9 years ago

Synthesis

  1. text2pdf.py and Tika extract the most words; however, Tika can also deal with different types of documents so our best option seems to be Tika.
  2. Tesseract can be used to extract text from documents that haven't been ORCed.

Moving forward

  1. We need to build a script that would allow use both Tika and Tesseract optimally, ie when the document contains less than 100 words we run it though Tesseract, etc.
  2. Implement the script above into the document scrapers.
geramirez commented 9 years ago

Details

Tika

I found the best way to run Tika is to start the server: java -jar tika-app-1.7.jar --server --text --port 9998 feed in documents using netcat, and write the results into the text file nc localhost 9998 < document.pdf > document.txt Without running the server, Tika will open the Java console for each document slowing down the process.

Tesseract

Tesseract is a little complicated to run with PDFs because it doesn't inherently process PDF files. PDF files must first be converted into an image file and then send to Tesseract. For testing I used this shell script to convert PDFs into images (using GS) and then into text.

Detecting ORCrd documents

Detecting responsive documents was more difficult than I thought originally. Occasionally, PDFs which have not been ORCrd will have some responsive text. Hence will probably be important to test a "words extracted" threshold to trigger using Tesseract.