camelot-dev / camelot

A Python library to extract tabular data from PDFs
https://camelot-py.readthedocs.io
MIT License
2.81k stars 449 forks source link

Lattice "iterations" parameter #363

Open Antonin21 opened 1 year ago

Antonin21 commented 1 year ago

Hi,

I'm using lattice for a PDF in which I have a table with lines which doesn't cross each other, like this (horizontal mask) : image A good approach to solve this would be to dilate, then erode the mask. After dilate, lines will cross each other, and then erode will restore the table to its original dimensions.

There is already a documented iterations parameter, which I think might have been added for such issue :

iterations (int, optional (default: 0)) – Number of times for erosion/dilation is applied. For more information, refer OpenCV’s dilate.

If I use iterations=1, I end up with the following mask (horizontal and vertical merged, bottom-right of the table) : image

It 'works', as now lines do cross each other, but it only dilates the image. As a result, the detected grid contains an additional line at the top and bottom of the table.

I would suggest the following change in image_processing.py to solve this :

 dmask = cv2.dilate(threshold, el, iterations=iterations)
+dmask = cv2.erode(dmask, el, iterations=iterations)

However, this could potentially break some existing software, and I'm not sure why only dilate was added in the first place. Maybe adding a new parameter erode_iterations would be better. What do you think ? I can make a PR for this change if requested.