atlanhq / camelot

Camelot: PDF Table Extraction for Humans
https://camelot-py.readthedocs.io
Other
3.64k stars 354 forks source link

Reduce file reads in camelot.handlers._save_page #303

Closed niazangels closed 5 years ago

niazangels commented 5 years ago

camelot.handler._save_page is called as many times as there are pages passed to camelot.read_pdf. Each time this function is invoked, the source PDF is read from disk, parsed using PdfFileReader and is decrypted. This is something which can be reduced that contributes significantly to performance.

A great way to avoid this is accept a list of pages instead of page and run _save_pages function only once. The PdfFileReader object can be created once and we can loop over pages to save the pages separately.

I have this already working on a private fork with one hiccup that the PdfFileReader object gets modified for certain files after successfully looping and extracting ~80 pages in some of my sample PDFs. I create a copy of the original object to work around this but its a whole lot faster than the current approach as it completely avoids the 80+ file reads.

Let me know if this is something you'd like to incorporate, and I'd be happy to raise a pull request.

Cheers, and thanks for all the great work! :smile:

vinayak-mehta commented 5 years ago

@niazangels Thanks for pointing it out! Please raise that pull request :)

niazangels commented 5 years ago

Done! Could you please review https://github.com/socialcopsdev/camelot/pull/311 ?

vinayak-mehta commented 5 years ago

Thanks for the PR, I'll do it this weekend!

vinayak-mehta commented 5 years ago

Closing in favor of https://github.com/camelot-dev/camelot/issues/21.