Open vinayak-mehta opened 5 years ago
@vinayak-mehta Any update on this? I've been testing a bit, and a simple one page PDF file I can process in under 1 second, however serving a two page PDF file it increases to around 5 seconds.
@oliverbj Can you post that file?
Hi @vinayak-mehta , you can check this file https://we.tl/t-1giDJuXVnJ It is very slow. Parsing 10 pages took 1340s.
Hi @vinayak-mehta, my PR to speed up pdfminer layout analysis just got merged to the develop branch. It makes the file I posted in the PR page 20 times faster. But for other PDFs the speed up is not that significant. Hope it helps camelot
too once released.
@mikkkee Thanks for the pdfminer PR!
The function that checks for page rotation is the culprit. pdfminer's layout analysis takes a long time for such pdfs. Examples: the RNTB pdfs from un-sdg.
Adding a kwarg which lets user specify rotation can is a minor optimization that can fix this.