atlanhq / camelot

Camelot: PDF Table Extraction for Humans
https://camelot-py.readthedocs.io
Other
3.64k stars 354 forks source link

Superposition from text in different columns #273

Closed jaimeorrego closed 5 years ago

jaimeorrego commented 5 years ago

Thank you for the great package. I have a small issue in a PDF that is based on a spreadsheet. One of the columns refers to an address and the following column refers the district. The issue happens when the text of one column is longer than the column width (columns 5 and 6 in the last row). image In the PDF you cannot see the total length of the address as is hidden behind the district column, so you cannot read the text. However, the text exists and is retrieved by camelot. The problem that I have is that Camelot in those cases does not separate the text of each column. When I set the columns width, Camelot thinks that the disctrict name is part of the address string and put it there and leave empty the district column value for that row. When I set up split_text to true Camelot gives the district names plus the extra words of the address that aren't in the corresponding place of the address column. Therefore, my issue is to know if there is a way to divide the content of both columns when the text of one column is over the extra text of the column behind?

anakin87 commented 5 years ago

Please provide the original PDF and your output...

jaimeorrego commented 5 years ago

Here is the pdf test2.pdf Here is the result for the column separator without split text camelot --format csv --output test2.csv stream -C 85,140,181,399,458 test2.pdf 2019-02-07T14:37:38 - INFO - Processing page-1 Found 1 tables test2-page-1-table-1.xlsx And here when using the split text camelot --format csv --output test2-1.csv -split stream -C 85,140,181,399,458 test2.pdf 2019-02-07T14:39:41 - INFO - Processing page-1 Found 1 tables test2-1-page-1-table-1.xlsx Thanks!

vinayak-mehta commented 5 years ago

Since the overlapping strings share the same x coordinates, they'll be put into the same columns. You can do some cleaning and separate the address from districts (assuming districts is a finite set).

jaimeorrego commented 5 years ago

Okay. Thank you. I did that and works fine, just was checking it is was the right behavior.