camelot-dev / camelot

A Python library to extract tabular data from PDFs
https://camelot-py.readthedocs.io
MIT License
2.96k stars 466 forks source link

ZeroDivisionError in text_in_bbox for some tables #318

Open peterhvoth opened 2 years ago

peterhvoth commented 2 years ago

camelot.read_pdf(fn, pages='all') fails occasionally with a ZeroDivisionError.

I chased the problem down to page 20 on https://www.accessdata.fda.gov/cdrh_docs/pdf6/P060040B.pdf

Near as I can tell, the problem relates to the fact that the page includes both a table that includes a bar graph. At least there's no problem with extracting tables on any other page in that file, and that's the only thing that stood out to me as being different about that page.

If that's actually the case might it be that the bars are being interpreted as really thick table lines?

Regardless, I was able to fix it locally by changing line 374 of utils.py from if bbox_intersect(ba, bb): to if bbox_intersect(ba, bb) and bbox_area(ba): which I'm hoping will be a clue to what's causing the problem.