bitextor / pdf-extract

PDF parser and converter to HTML
GNU General Public License v3.0
83 stars 14 forks source link

pdf-extract timeout option #30

Closed lpla closed 4 years ago

lpla commented 4 years ago

I found some small PDF documents with some pages that contain (or pdftohtml detects as) many images that makes pdf-extract go dramatically slow. These PDFs are still running pdftohtml command (from poppler-rewrite) in a server after 2 days:

pdf-37183335900956442200.3957922402592299.pdf pdf-88372559814386653860.962414730293883.pdf

Workaround for this specific case could be simply discarding a page with a number of images higher than a threshold, or a timeout that could cover other kind of issues with Poppler.

dionwiggins commented 4 years ago

Thanks. We could also work on removing the images from the PDF before processing as a workaround. Interesting feedback. I will have Mui investigate.

lpla commented 4 years ago

I see you implemented it, but probably 30 seconds default is too low. 10 minutes should be a maximum we don't want to reach when processing a PDF.

dionwiggins commented 4 years ago

Changed to 10 minutes as default. Closing.