pdftotext (xpdf/poppler) works but 'pdf2txt.html -t html' does not

euske / pdfminer

Python PDF Parser (Not actively maintained). Check out pdfminer.six.

https://github.com/pdfminer/pdfminer.six

MIT License

5.25k stars 1.13k forks source link

pdftotext (xpdf/poppler) works but 'pdf2txt.html -t html' does not #46

Open umrashrf opened 10 years ago

umrashrf commented 10 years ago

I have a PDF file which is extracting whole text from PDF file with pdftotext by poppler but pdf2txt by PDFMiner fails to extract whole text.

Although pdftotext by poppler gives an error but extract whole text.

Error: PDF file is damaged - attempting to reconstruct xref table...

Looks like xpdf got some reconstruction ability and PDFMiner didn't.

euske commented 10 years ago

Could you upload or send me the PDF in question?

umrashrf commented 10 years ago

Sure I can, where should I send? I would resist uploading here though.

euske commented 10 years ago

Hi, thanks for the pdf. I looked into it, and found that the missing texts are actually not a part of the page content, but implemented as a part of Acrobat form. It wasn't a problem of malformed PDF. Right now, pdfminer doesn't support extraction from a form. It shouldn't be that hard though, so in future I will try to add those features.

umrashrf commented 10 years ago

No problem. Look forward to that feature then. Until then I am okay to use poppler.