atlanhq / camelot

Camelot: PDF Table Extraction for Humans
https://camelot-py.readthedocs.io
Other
3.61k stars 349 forks source link

Bug: Hardcoded value of '10' limits number of tables in page #487

Closed ottohirr closed 1 year ago

ottohirr commented 1 year ago

In function, find_contours, located in file, image_processing.py, there are the following two lines:

    # sort in reverse based on contour area and use first 10 contours
    contours = sorted(contours, key=cv2.contourArea, reverse=True)[:10]

This drops any tables past a count of 10 per page.

It may seem reasonable that there would be less than 10 tables for a page.

A simple example of a pdf that may contain more than 10 tables for a page would be a work schedule where there is a box around some small set of scheduled people, say for a given department. There may be several, more than 10, departments listed on the page.

This should not be hardcoded numeric value, but a settable parameter.

Regards,

..Otto

ottohirr commented 1 year ago

Filed in active repo at https://github.com/camelot-dev/camelot/issues/319