Closed lpla closed 4 years ago
Thanks. We could also work on removing the images from the PDF before processing as a workaround. Interesting feedback. I will have Mui investigate.
I see you implemented it, but probably 30 seconds default is too low. 10 minutes should be a maximum we don't want to reach when processing a PDF.
Changed to 10 minutes as default. Closing.
I found some small PDF documents with some pages that contain (or pdftohtml detects as) many images that makes pdf-extract go dramatically slow. These PDFs are still running
pdftohtml
command (frompoppler-rewrite
) in a server after 2 days:pdf-37183335900956442200.3957922402592299.pdf pdf-88372559814386653860.962414730293883.pdf
Workaround for this specific case could be simply discarding a page with a number of images higher than a threshold, or a timeout that could cover other kind of issues with Poppler.