atlanhq / camelot

Camelot: PDF Table Extraction for Humans
3.61k stars 349 forks source link

Division by zero without using table_regions #480

Open cmartinotti opened 2 years ago

cmartinotti commented 2 years ago

Hello, I'm trying to extract tables with defaults parameters in stream mode. I try:

tables_cam=camelot.read_pdf(filepath='pdfs_files/fulltext.pdf', pages="10",flavor='stream' )

It returns;

ZeroDivisionError                         Traceback (most recent call last)
/tmp/ipykernel_7338/ in <module>
----> 1 tables_cam=camelot.read_pdf(filepath='pdfs_files/fulltext.pdf',
      2                             pages="9,10",
      3                             flavor='stream',
      4                             edge_tol=500
      5                            )

~/anaconda3/envs/test/lib/python3.8/site-packages/camelot/ in read_pdf(filepath, pages, password, flavor, suppress_stdout, layout_kwargs, **kwargs)
    111         p = PDFHandler(filepath, pages=pages, password=password)
    112         kwargs = remove_extra(kwargs, flavor=flavor)
--> 113         tables = p.parse(
    114             flavor=flavor,
    115             suppress_stdout=suppress_stdout,

~/anaconda3/envs/test/lib/python3.8/site-packages/camelot/ in parse(self, flavor, suppress_stdout, layout_kwargs, **kwargs)
    174             parser = Lattice(**kwargs) if flavor == "lattice" else Stream(**kwargs)
    175             for p in pages:
--> 176                 t = parser.extract_tables(
    177                     p, suppress_stdout=suppress_stdout, layout_kwargs=layout_kwargs
    178                 )

~/anaconda3/envs/test/lib/python3.8/site-packages/camelot/parsers/ in extract_tables(self, filename, suppress_stdout, layout_kwargs)
    461             sorted(self.table_bbox.keys(), key=lambda x: x[1], reverse=True)
    462         ):
--> 463             cols, rows = self._generate_columns_and_rows(table_idx, tk)
    464             table = self._generate_table(table_idx, cols, rows)
    465             table._bbox = tk

~/anaconda3/envs/test/lib/python3.8/site-packages/camelot/parsers/ in _generate_columns_and_rows(self, table_idx, tk)
    323         # select elements which lie within table_bbox
    324         t_bbox = {}
--> 325         t_bbox["horizontal"] = text_in_bbox(tk, self.horizontal_text)
    326         t_bbox["vertical"] = text_in_bbox(tk, self.vertical_text)

~/anaconda3/envs/test/lib/python3.8/site-packages/camelot/ in text_in_bbox(bbox, text)
    374             if bbox_intersect(ba, bb):
    375                 # if the intersection is larger than 80% of ba's size, we keep the longest
--> 376                 if (bbox_intersection_area(ba, bb) / bbox_area(ba)) > 0.8:
    377                     if bbox_longer(bb, ba):
    378                         rest.discard(ba)

ZeroDivisionError: float division by zero

I would expect it to return no tables found (like normally it does) rather than crashing for a 0 division. How do I prevent this? PDF_FILE: fulltext.pdf

PS: If I transform a pdf page into an image, find the table area on the image and then decide to pass the corresponding area to camelot to extract the tables, is the conversion from the position on the image to the position in the pdf just pos_image * size_pdf/size_img ?

ashleych commented 2 years ago

Hello @cmartinotti Any luck with this? I am stuck with the same issue, but in an Arabic language pdf