bizres / report-text-extraction

BSD 3-Clause "New" or "Revised" License
0 stars 1 forks source link

PDF to text extraction quality #4

Open dev-ng opened 2 years ago

dev-ng commented 2 years ago
dev-ng commented 2 years ago

Inspired by: https://github.com/inidun/unesco_data_collection/issues/5

Tool Licence Requirements Details Result
pdfplumber MIT Built on pdfminer.six. Extracts simple text. Fails on tables and multi-columns.
pdfbox Apache License 2.0 java Extracts text from multi-columns. Fails on many PDFs (because of encryption flag - ?).
tika Apache License 2.0 Java 7+ Starts up Tika REST server in the background. Extracts text from multi-columns. Separates paragraphs with empty lines. This separation is not reliable as not always accurate.
pdftotext MIT Tricky to install under windows. Requires c++ build tools and poppler. Extracts text from multi-columns. Extracts paragraphs and separates pages.
pdfminer.six MIT Extracts multi columns, separates paragraphs even better than pdftotext. Does pagination. Is slow.
dev-ng commented 2 years ago

New requirements: