atlanhq / camelot

Camelot: PDF Table Extraction for Humans
https://camelot-py.readthedocs.io
Other
3.65k stars 355 forks source link

Camelot speed up #427

Open majd1239 opened 4 years ago

majd1239 commented 4 years ago

While using camelot to extract tables from pdfs. I noticed it's really slow. I profiled the code and turns out that %60 of the bottleneck is from np.isclose here and here as well as multiple other places in core.py:

https://github.com/atlanhq/camelot/blob/cd8ac7979fe3631866fe439f07e9d6aaa5b1e5c6/camelot/core.py#L103

https://github.com/atlanhq/camelot/blob/cd8ac7979fe3631866fe439f07e9d6aaa5b1e5c6/camelot/core.py#L67

The slowdown makes sense since there is a very big overhead with np.isclose if we are dealing with native python floats instead of numpy types.

I switched the method to math.isclose instead and the processing time was reduced to more than half!

I can submit an Pull Request with the changes if the devs agree this is a safe change to make.

Thanks

kamil-kierski commented 4 years ago

I've checked your suggestion and studied through the code and looks like its only relevant for "Stream" flavor, with Lattice - the provided code snippets are not even used ;(

FrancoisHuet commented 4 years ago

I've noticed a similar issue. np.isclose is slow, but one thing that doesn't help is that:

=> so that's O(n^2) calls to isclose.

While refactoring stream and network, I've changed this a bit by doing a binary search (method get_index_closest_point, log(n) complexity), before calling isclose. This makes overall the code faster, and reduces calls to isclose drastically. Code change is here. In my local tests this made the unit test suite faster by ~20% (from 85s to 69s), with the image generation becoming the next bottleneck.

Bottom line: once the Pull Request for the new hybrid network is merged, this might no longer be needed.

sttobia commented 3 years ago

Is there any update on that? I am currently also using stream which takes very long. The proposal of @majd1239 sounds very interesting, also if it just improves the speed for stream.

dss010101 commented 1 year ago

update?