camelot-dev / camelot

A Python library to extract tabular data from PDFs
https://camelot-py.readthedocs.io
MIT License
2.99k stars 472 forks source link

Index out of range in Lattice Parser #98

Open bikashgupta11 opened 4 years ago

bikashgupta11 commented 4 years ago

For one of the document I have got index out of range. Please suggest to handle the error. May be for few row there can be error but can we save the partial extracted data to dataframe?

Config for parser

parser = Lattice(line_scale=30, split_text=True)

Sample Input Table

tempsnip

Error

File "/home/ubuntu/.local/lib/python3.6/site-packages/camelot/parsers/lattice.py", line 412, in extract_tables
    table = self._generate_table(table_idx, cols, rows, v_s=v_s, h_s=h_s)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/camelot/parsers/lattice.py", line 355, in _generate_table
    table, indices, shift_text=self.shift_text
  File "/home/ubuntu/.local/lib/python3.6/site-packages/camelot/parsers/lattice.py", line 160, in _reduce_index
    if t.cells[r_idx][c_idx].hspan:
IndexError: list index out of range
bikashgupta11 commented 4 years ago

Please find below fix for the same

https://github.com/camelot-dev/camelot/pull/99