WZBSocialScienceCenter / pdftabextract

A set of tools for extracting tables from PDF files helping to do data mining on (OCR-processed) scanned documents.
https://datascience.blog.wzb.eu/2017/02/16/data-mining-ocr-pdfs-using-pdftabextract-to-liberate-tabular-data-from-scanned-documents/
Apache License 2.0
2.21k stars 369 forks source link

Use pdftabextract convert pdf which is converted by a picture #7

Closed CapitaineNemo closed 6 years ago

CapitaineNemo commented 6 years ago

Hi, I try to convert a pdf to excel, but it failed. It is a table in a picture. I convert the picture into pdf , then I use the code to convert. It failed. So could you tell me what kind of pdf can be converted?

internaut commented 6 years ago

pdftabextract is not an OCR software but a tool to help extract tabular information from already OCR-processed PDFs. You need to use OCR software like Abbyy FineReader first to recognize the text in your page(s).