Open majd1239 opened 4 years ago
I've checked your suggestion and studied through the code and looks like its only relevant for "Stream" flavor, with Lattice - the provided code snippets are not even used ;(
I've noticed a similar issue. np.isclose
is slow, but one thing that doesn't help is that:
TextEdges.generate
calls TextEdges.update
n (number of textlines in the doc) times (code)
=> so that's O(n^2) calls to isclose
.
While refactoring stream
and network
, I've changed this a bit by doing a binary search (method get_index_closest_point
, log(n) complexity), before calling isclose
. This makes overall the code faster, and reduces calls to isclose
drastically.
Code change is here. In my local tests this made the unit test suite faster by ~20% (from 85s to 69s), with the image generation becoming the next bottleneck.
Bottom line: once the Pull Request for the new hybrid network is merged, this might no longer be needed.
Is there any update on that? I am currently also using stream
which takes very long. The proposal of @majd1239 sounds very interesting, also if it just improves the speed for stream
.
update?
While using camelot to extract tables from pdfs. I noticed it's really slow. I profiled the code and turns out that %60 of the bottleneck is from
np.isclose
here and here as well as multiple other places in core.py:https://github.com/atlanhq/camelot/blob/cd8ac7979fe3631866fe439f07e9d6aaa5b1e5c6/camelot/core.py#L103
https://github.com/atlanhq/camelot/blob/cd8ac7979fe3631866fe439f07e9d6aaa5b1e5c6/camelot/core.py#L67
The slowdown makes sense since there is a very big overhead with
np.isclose
if we are dealing with native python floats instead of numpy types.I switched the method to
math.isclose
instead and the processing time was reduced to more than half!I can submit an Pull Request with the changes if the devs agree this is a safe change to make.
Thanks