Open dev-ng opened 3 years ago
Inspired by: https://github.com/inidun/unesco_data_collection/issues/5
Tool | Licence | Requirements | Details | Result |
---|---|---|---|---|
pdfplumber | MIT | Built on pdfminer.six. | Extracts simple text. Fails on tables and multi-columns. | |
pdfbox | Apache License 2.0 | java | Extracts text from multi-columns. Fails on many PDFs (because of encryption flag - ?). | |
tika | Apache License 2.0 | Java 7+ | Starts up Tika REST server in the background. | Extracts text from multi-columns. Separates paragraphs with empty lines. This separation is not reliable as not always accurate. |
pdftotext | MIT | Tricky to install under windows. Requires c++ build tools and poppler. | Extracts text from multi-columns. Extracts paragraphs and separates pages. | |
pdfminer.six | MIT | Extracts multi columns, separates paragraphs even better than pdftotext. Does pagination. Is slow. |
New requirements: