Closed jps-ob closed 5 years ago
I tried line_scale upto 75
tables = camelot.read_pdf(self.get_full_path(pdf_file_name),
multiple_tables=True,
line_scale=75,
pages='all')
I can fudge to get it working by instructing reader to split the cells , however this wont work if there are merged cells in the document
tables = camelot.read_pdf(self.get_full_path(pdf_file_name),
multiple_tables=True,
split_text=True, <<<< this
line_scale=75,
pages='all')
For second row 'Strasse' and 'Erlaubnis erteilt am' cell values are merged to 'Strasse' leaving 'Erlaubnis erteilt am' empty.
You can try visual debugging (kind='grid'
) from the advanced section of the documentation to see the detected table and tweak line_scale
to get to your desired structure.
I don't think multiple_tables
is a valid keyword argument.
dl_li_zag_e-geld-institute.pdf
First 2 rows of the first table looks like this
| Instituts-ID |........| Strasse | Erlaubnis erteilt am | | 119636 |........| Esprit-Allee 1 | 30.04.2011 (§ 36 Abs. 1 ZAG) | | 123367 |........| Daniel-Goldbach-Straße 17 - 19 | 24.10.2012 (§ 8a Abs. 1 ZAG) |
For second row 'Strasse' and 'Erlaubnis erteilt am' cell values are merged to 'Strasse' leaving 'Erlaubnis erteilt am' empty
Looks like cell boundary detection issue ?
parsed data
`tables[0].data Out[28]: [
['Instituts-ID', 'Name', 'PLZ', 'Ort', 'Strasse', 'Erlaubnis erteilt am', 'Erlaubnis Ende \nam', 'Umfang Erlaubnis'],
['119636', 'Esprit Card Services GmbH', '40882', 'Ratingen', 'Esprit-Allee 1', '30.04.2011 (§ 36 Abs. 1 ZAG)', '', 'E-Geld-Institut'],
['123367', 'Ingenico Payment Services GmbH', '40880', 'Ratingen', 'Daniel-Goldbach-Straße 17 - 19 24.10.2012 (§ 8a Abs. 1 ZAG)', '', '', 'E-Geld-Institut'],
['125314', 'PayCenter GmbH', '85354', 'Freising, Oberbay', 'Max-Lehner-Straße 1a', '06.07.2012 (§ 8a Abs. 1 ZAG)', '', 'E-Geld-Institut'],