bitextor / pdf-extract

PDF parser and converter to HTML
GNU General Public License v3.0
83 stars 14 forks source link

Branch poppler-rewrite does not extract any text #25

Closed lpla closed 4 years ago

lpla commented 4 years ago

I tested poppler-rewrite included java (runnable-jar/PDFExtract.jar) in a machine with Ubuntu 16.04 (as of today, the only OS in which it works because #22 ) with several PDFs I own and some Internet Archive files and I only get:

<html>
<head>
<defaultLang abbr="en" />
<languages>
</languages>
</head>
<body>
<div id="page0" class="page">
</div>
</body>
</html>

Is it just me? Master code (based on pdfbox) works.

dionwiggins commented 4 years ago

Please provide the PDFs. We have tested about 20K files without issue so need to reproduce on the file that you are having an issue with. It could be protected or a number of other reasons that it returns empty. With the sample we can diagnose.

lpla commented 4 years ago

This one, for example, extracts text with master code (pdf-box) but nothing in poppler-rewrite: https://www.dlsi.ua.es//~mlf/docum/forcada16j.pdf

lpla commented 4 years ago

The command I use is:

~/pdf-extract$ java -jar runnable-jar/PDFExtract.jar -I ~/forcada16j.pdf -O test
dionwiggins commented 4 years ago

This is resolved in the latest release above.