camelot-dev / camelot

A Python library to extract tabular data from PDFs
https://camelot-py.readthedocs.io
MIT License
3k stars 472 forks source link

Page splitting is very slow for some PDFs #11

Open vinayak-mehta opened 5 years ago

vinayak-mehta commented 5 years ago

The function that checks for page rotation is the culprit. pdfminer's layout analysis takes a long time for such pdfs. Examples: the RNTB pdfs from un-sdg.

Adding a kwarg which lets user specify rotation can is a minor optimization that can fix this.

oliverbj commented 5 years ago

@vinayak-mehta Any update on this? I've been testing a bit, and a simple one page PDF file I can process in under 1 second, however serving a two page PDF file it increases to around 5 seconds.

vinayak-mehta commented 5 years ago

@oliverbj Can you post that file?

mikkkee commented 5 years ago

Hi @vinayak-mehta , you can check this file https://we.tl/t-1giDJuXVnJ It is very slow. Parsing 10 pages took 1340s.

mikkkee commented 5 years ago

Hi @vinayak-mehta, my PR to speed up pdfminer layout analysis just got merged to the develop branch. It makes the file I posted in the PR page 20 times faster. But for other PDFs the speed up is not that significant. Hope it helps camelot too once released.

vinayak-mehta commented 4 years ago

@mikkkee Thanks for the pdfminer PR!