atlanhq / camelot

Camelot: PDF Table Extraction for Humans
https://camelot-py.readthedocs.io
Other
3.64k stars 354 forks source link

camelot transforms multiple lines into seperated rows #266

Closed asmiy closed 5 years ago

asmiy commented 5 years ago

When I try to extract table from pdf, camelot transforms rows with multiple lines to multiple rows. The pdf file : 8.pdf

I'm using stream as flavor and table_areas to detect the table.

vinayak-mehta commented 5 years ago

@asmiy Yes, this is the expected behavior for the Stream flavor. It doesn't take into account any lines that may separate table rows, and forms rows based on how the text is organized to form a table, leading to different text rows being treated as different table rows.

asmiy commented 5 years ago

@vinayak-mehta is there any option so it can take into account table lines?

vinayak-mehta commented 5 years ago

Currently, there isn't an option which takes into account only horizontal lines with stream. You can add a post-extraction step where you merge two rows together if the second row contains a string only in the first column, as seen here.

Please close this if your issue was solved.

boranaf commented 5 years ago

@vinayak-mehta could you pls write a sample code for that "post extraction step" you mention many thanks