anjesh / pdf-processor

Processes the pdf and finds whether it's text based or not and extracts data from the pdf in either case
MIT License
5 stars 7 forks source link

Scanned vs Structured PDF Processor

The script identifies whether the given pdf is structured (text based) or scanned one. If it's the text based pdf, it uses pdftotext tool to extract the text content and saves pages in the given folder. It also separates the pdf into individual pdf pages using pdftk.

Prerequisites

Make sure that pdftotext, pdfinfo and pdftk are installed in your computer. pdftotext and pdfinfo are available in poppler-utils. Pdftk has to be installed separately.

Installing pdftk in Amazon Linux

Apparently pdftk can't be installed easily in Amazon Linux. However there's a workaround.

How it works

Test

Execute bash runtest.sh to run all above tests at once.

Run

TODO