atlanhq / camelot

Camelot: PDF Table Extraction for Humans
https://camelot-py.readthedocs.io
Other
3.65k stars 357 forks source link

Table extraction extract more rows than actually exists #315

Closed sweco-sekrsv closed 5 years ago

sweco-sekrsv commented 5 years ago

Sometimes table extraction extract more rows than actually exists. In the attached file for example. Visually the first table at the top contains 4 columns an 3 rows but Camelot reports 4 columns and 5 rows. 41-21-40001.pdf

I'm extracting using this command: Using Camelot 0.7.2 tables = camelot.read_pdf(loadpath,flavor='lattice', line_scale=40)

CartierPierre commented 5 years ago

I guess Camelot is extracting the second table (5 row, 4 col)

sweco-sekrsv commented 5 years ago

Thanks for the input but the second table are correctly found with 4 rows and 5 columns.

CartierPierre commented 5 years ago

So you extract 2 tables, both with 4 rows and 5 cols ? Have you tried using templates ?

sweco-sekrsv commented 5 years ago

No:) I extract two tables, thats correct.

The first table on the top have visually: 4 columns and 3 rows Camelot reports (wrongly) 4 columns and 5 rows

The second table at the bottom have visually: 5 columns and 4 rows Camelot reports (correctly) 5 columns and 4 rows

vinayak-mehta commented 5 years ago

More rows are being detected in the first table due to the text underlines in the bottom two rows intersecting with the table boundary.

sweco-sekrsv commented 5 years ago

Ahh. I see, it might be fixable then.

As a workaround, do you know of any library that can find and strip out the underlined text before running camelot?

vinayak-mehta commented 5 years ago

You can try reducing the line_scale, which is 15 by default. You can check out the docs for it here. It works both ways.

I just tried with a value of 14 and was able to discard the underlines in the first table.

$ camelot lattice -scale 14 -plot grid underlined-text.pdf