Closed sweco-sekrsv closed 5 years ago
I guess Camelot is extracting the second table (5 row, 4 col)
Thanks for the input but the second table are correctly found with 4 rows and 5 columns.
So you extract 2 tables, both with 4 rows and 5 cols ? Have you tried using templates ?
No:) I extract two tables, thats correct.
The first table on the top have visually: 4 columns and 3 rows Camelot reports (wrongly) 4 columns and 5 rows
The second table at the bottom have visually: 5 columns and 4 rows Camelot reports (correctly) 5 columns and 4 rows
More rows are being detected in the first table due to the text underlines in the bottom two rows intersecting with the table boundary.
Ahh. I see, it might be fixable then.
As a workaround, do you know of any library that can find and strip out the underlined text before running camelot?
You can try reducing the line_scale
, which is 15 by default. You can check out the docs for it here. It works both ways.
I just tried with a value of 14 and was able to discard the underlines in the first table.
$ camelot lattice -scale 14 -plot grid underlined-text.pdf
Sometimes table extraction extract more rows than actually exists. In the attached file for example. Visually the first table at the top contains 4 columns an 3 rows but Camelot reports 4 columns and 5 rows. 41-21-40001.pdf
I'm extracting using this command: Using Camelot 0.7.2
tables = camelot.read_pdf(loadpath,flavor='lattice', line_scale=40)