atlanhq / camelot

Camelot: PDF Table Extraction for Humans
https://camelot-py.readthedocs.io
Other
3.65k stars 355 forks source link

Bbox_intersection_area with a ZeroDivisionError: float division by zero #471

Open olivierbouman opened 2 years ago

olivierbouman commented 2 years ago

Hello all,

I am trying to extract some tables from a pdf using camelot-py Version: 0.10.1 with the following setup

tables = camelot.read_pdf(
    in_file,
    flavor='stream',
    table_areas=[self._camelot_table_area(bbox)],
    pages=str(page_nr),
)

Basically all of the tables are parsed very well, but one table throws the following error:

  File "/home/**********/.local/lib/python3.8/site-packages/camelot/utils.py", line 379, in text_in_bbox
    if (bbox_intersection_area(ba, bb) / bbox_area(ba)) > 0.8:
ZeroDivisionError: float division by zero

Looking at camelot/utils.py it seems that camelot encountered/created a one-dimensional table TextLine element:

<LTTextLineHorizontal 65.883,392.092,65.883,404.102 'ً\n'>

This value represents ba in the code from utils.py line 379-381:

if (bbox_intersection_area(ba, bb) / bbox_area(ba)) > 0.8:
    if bbox_longer(bb, ba):
        rest.discard(ba)

The area of ba is zero, thus the division by zero error occurs.

Is this by any chance a problem anyone else encountered before? And if so, any possible solutions?

It also seems that this could possibly be catched by checking for a zero size area, or was this left out of the code on purpose?:)

Many thanks in advance!

saidakyuz commented 1 year ago

Can you check the properties of the PDF? It could be secured for extraction.

BouregagYoucef commented 1 year ago

Did you solve the problem? If yes, give me the solution please, I had the same problem

BouregagYoucef commented 1 year ago

https://camelot-py.readthedocs.io/en/master/user/install-deps.html

try this method

Fadheler commented 10 months ago

In case someone is having this error, I fixed this by changing the line in utils.py from: if (bbox_intersection_area(ba, bb) / bbox_area(ba)) > 0.8: to: if bbox_intersection_area(ba, bb) > bbox_area(ba)*0.8:

nikkiBot commented 8 months ago

@Fadheler Yep, this does solve the issue, but some parts of the pdf do not get recognised (in my case, the top row in the tables was empty) This probably has something to do with tweaking the tolerance parameters row_tol and column_tol

mahesh-solanke commented 1 month ago

@olivierbouman

In case someone is having this error, I fixed this by changing the line in utils.py from: if (bbox_intersection_area(ba, bb) / bbox_area(ba)) > 0.8: to: if bbox_intersection_area(ba, bb) > bbox_area(ba)*0.8:

As this is a change in the package, we should not change the library code directly. to tackle it, we have changed and created .whl file that can be used as a package till this issue is fixed in the Package itself

Please find the attachment for references https://drive.google.com/file/d/1COKC7s9uez8neZgrgaOjbUcLy56Ib7QO/view?usp=sharing