camelot-dev / camelot

A Python library to extract tabular data from PDFs
https://camelot-py.readthedocs.io
MIT License
2.95k stars 465 forks source link

if (bbox_intersection_area(ba, bb) / bbox_area(ba)) > 0.8: ZeroDivisionError: float division by zero #495

Open arjungandeeva opened 6 months ago

arjungandeeva commented 6 months ago

I'm encountering a ZeroDivisionError: float division by zero error in camelot-py when using the functions bbox_intersection_area and bbox_area. This error occurs under certain conditions, likely when the bounding box area (ba) is zero.

cktse commented 5 months ago

I did a quick fix/hack to circumvent the error by skipping over the area check if ba is singular (area is zero):

~/.pyenv/versions/3.11.3/lib/python3.11/site-packages/camelot/utils.py: Line 375:

            if bbox_area(ba) > 0 and bbox_intersect(ba, bb):
                # if the intersection is larger than 80% of ba's size, we keep the longest
                if (bbox_intersection_area(ba, bb) / bbox_area(ba)) > 0.8:
                    if bbox_longer(bb, ba):
                        rest.discard(ba)
bosd commented 1 month ago

Hey!

As https://github.com/camelot-dev/camelot/issues/343, we try to build a maintained fork at pypdf_table_extraction.

Do you want to check that code and open an issue / PR thereto include this fix?

cktse commented 1 month ago

I just took another look at the branches -- looks like this has already been fixed as part of "Release camelot-fork 0.20.1", which is already included in your fork: Release camelot-fork 0.20.1

bosd commented 1 month ago

Thanks for checking 👍

cktse commented 1 month ago

Great to see camelot lives on!

BTW is this fork going to be packaged on pip under a separate name? Think the current package is stale from the main branch.

bosd commented 1 month ago

BTW is this fork going to be packaged on pip under a separate name? Think the current package is stale from the main branch.

Yes, it is published here https://pypi.org/project/pypdf-table-extraction/

We're currently working on a new release, bymerging the open pr's from this repo, and rebranding the package.