atlanhq / camelot

Camelot: PDF Table Extraction for Humans
https://camelot-py.readthedocs.io
Other
3.64k stars 354 forks source link

boundary of cells are not detected when cells are tightl #274

Closed jps-ob closed 5 years ago

jps-ob commented 5 years ago

dl_li_zag_e-geld-institute.pdf

First 2 rows of the first table looks like this

| Instituts-ID |........| Strasse | Erlaubnis erteilt am | | 119636 |........| Esprit-Allee 1 | 30.04.2011 (§ 36 Abs. 1 ZAG) |   | 123367 |........| Daniel-Goldbach-Straße 17 - 19 | 24.10.2012 (§ 8a Abs. 1 ZAG) |

For second row 'Strasse' and 'Erlaubnis erteilt am' cell values are merged to 'Strasse' leaving 'Erlaubnis erteilt am' empty

Looks like cell boundary detection issue ?

parsed data

`tables[0].data Out[28]: [

['Instituts-ID', 'Name', 'PLZ', 'Ort', 'Strasse', 'Erlaubnis erteilt am', 'Erlaubnis Ende \nam', 'Umfang Erlaubnis'],

['119636', 'Esprit Card Services GmbH', '40882', 'Ratingen', 'Esprit-Allee 1', '30.04.2011 (§ 36 Abs. 1 ZAG)', '', 'E-Geld-Institut'],

['123367', 'Ingenico Payment Services GmbH', '40880', 'Ratingen', 'Daniel-Goldbach-Straße 17 - 19 24.10.2012 (§ 8a Abs. 1 ZAG)', '', '', 'E-Geld-Institut'],

['125314', 'PayCenter GmbH', '85354', 'Freising, Oberbay', 'Max-Lehner-Straße 1a', '06.07.2012 (§ 8a Abs. 1 ZAG)', '', 'E-Geld-Institut'],

jps-ob commented 5 years ago

I tried line_scale upto 75

tables = camelot.read_pdf(self.get_full_path(pdf_file_name),
                                  multiple_tables=True,
                                  line_scale=75,
                                  pages='all')
jps-ob commented 5 years ago

I can fudge to get it working by instructing reader to split the cells , however this wont work if there are merged cells in the document

 tables = camelot.read_pdf(self.get_full_path(pdf_file_name),
                                  multiple_tables=True,
                                  split_text=True,  <<<< this
                                  line_scale=75,
                                  pages='all')
vinayak-mehta commented 5 years ago

For second row 'Strasse' and 'Erlaubnis erteilt am' cell values are merged to 'Strasse' leaving 'Erlaubnis erteilt am' empty.

You can try visual debugging (kind='grid') from the advanced section of the documentation to see the detected table and tweak line_scale to get to your desired structure.

I don't think multiple_tables is a valid keyword argument.