WZBSocialScienceCenter / pdftabextract

A set of tools for extracting tables from PDF files helping to do data mining on (OCR-processed) scanned documents.
https://datascience.blog.wzb.eu/2017/02/16/data-mining-ocr-pdfs-using-pdftabextract-to-liberate-tabular-data-from-scanned-documents/
Apache License 2.0
2.21k stars 369 forks source link

Please help me. #15

Closed eaglecoder1023 closed 6 years ago

eaglecoder1023 commented 6 years ago

I want this table to be recognized using pdftabextract. roi_1_ I converted that to searchable pdf using tesseract and followed every step in this tutorial https://datascience.blog.wzb.eu/2017/02/16/data-mining-ocr-pdfs-using-pdftabextract-to-liberate-tabular-data-from-scanned-documents/ And I got this result output.xlsx I want to improve result as much similar as image. Please help me.

internaut commented 6 years ago

Closed. Please do not abuse the issue tracker for such requests.

Your table is well structured enough to be parsed without pdftabextract. Simply use tesseract for the OCR and then pdftotext (with the -layout parameter) from poppler-utils for converting it to plain text. Write a simple Python script to parse the text.