The script identifies whether the given pdf is structured (text based) or scanned one.
If it's the text based pdf, it uses pdftotext
tool to extract the text content and saves pages in the given folder. It also separates the pdf into individual pdf pages using pdftk
.
Make sure that pdftotext
, pdfinfo
and pdftk
are installed in your computer. pdftotext
and pdfinfo
are available in poppler-utils. Pdftk
has to be installed separately.
Apparently pdftk can't be installed easily in Amazon Linux. However there's a workaround.
pdfinfo
to get the total pages in the pdf and size and whether it's encryptedstats.json
with { "status":"Encryption", .. }
throws an Exception, and exits from the script.pdftotext
to dump the text and compares the size of the extract text content. If the text content size is 500 bytes in average for each page, then it is structured otherwise scanned one.pdftk
to extract each pdf page and saves in the pages
folder.pdftotext
to extract the text content page-wise and puts the txt files in the text
folder.TODO
stats.json
file with the following content (status = [Scanned|Structured|Encrypted])
{ "status": "Structured", "pages": 5 }
Execute bash runtest.sh
to run all above tests at once.
settings.config.bak
to settings.config
and update application-id and passwordpython run.py
to see the optionspython run.py -i tests/sample.pdf -o out -l french
creates folder out/text
with the extracted text files, out/pages
with the separated pdf files and out/stats.json
. In case of french contract, it OCRs the document in that language. For now only english, french and spanish are supported. Language is optional field and uses english by default.