Closed heixincai closed 4 years ago
when i use Excalibur,i find Autodetect Tables will find a smallTable,look at the picture,
if i remove the small, Camelot can extract tables.May be this is the same question,this is anything what i know.Thanks~
Pretty sure I have the same problem.
Running with flavor='stream', columns=['62,105,185,252']
and get
File "/home/user/.local/lib/python3.7/site-packages/camelot/io.py", line 117, in read_pdf
**kwargs
File "/home/user/.local/lib/python3.7/site-packages/camelot/handlers.py", line 172, in parse
p, suppress_stdout=suppress_stdout, layout_kwargs=layout_kwargs
File "/home/user/.local/lib/python3.7/site-packages/camelot/parsers/stream.py", line 458, in extract_tables
cols, rows = self._generate_columns_and_rows(table_idx, tk)
File "/home/user/.local/lib/python3.7/site-packages/camelot/parsers/stream.py", line 336, in _generate_columns_and_rows
if self.columns is not None and self.columns[table_idx] != "":
IndexError: list index out of range
Any way to just ignore tables that don't fall in the provided columns and move on? I get this on a lot of PDFs but the PDFs all have slightly different stuff surrounding the table I'm interested in. So I would just like to move on without failing and later filter the tables that fit my scheme.
PS: I already limited to the right pages but can't / don't know how to give the stream parser a concrete starting and ending point.
Somewhat dirty workaround that works for my case:
cols = ['62,105,185,252']
cols *= 128 # <-- workaround: just make sure to have enough of the same col set for all tables that will be discovered. e.g. ['62,105,185,252', '62,105,185,252', .....]
camelot.read_pdf(pdf_file, flavor='stream', columns=cols)
@heixincai let me know if it helps you too :)
@heixincai @pachacamac If you know the approximate location of the table in your PDF (assuming the table always lies in this general area in all PDFs that you have), you can specify table_regions to make camelot look for tables in only these regions.
@vinayak-mehta for me the problem is that I have PDFs where I'm interested in tables by structure (same columns etc) but different height, y-position, etc. on multiple pages (unknown number of pages)
Perhaps, we can put in another filter to weed out tables which do not have a certain width/height as a parameter inside the library.
sorry,I have been busy with my project these days.My solution is to just get the location information of the entire PDF page.Then filter the parsed data.Maybe my method is just for myself,But the problem has been solved. Thanks~
Has this been fixed now?
@pachacamac Opened it here https://github.com/camelot-dev/camelot/issues/50
@pachacamac Great. It's really work.
When i trying to read this pdf,i got this question: i don't how to solve it,thanks AreaPercent-lc-multPageTable1.pdf
D:\LES\Environment\Python\python.exe D:/LES/Python/Code/ReadNoBorderTable.py Traceback (most recent call last): File "D:/LES/Python/Code/ReadNoBorderTable.py", line 7, in
tables = camelot.read_pdf(pdfPath,flavor=tableType,strip_text=' .\n',columns=['58,107,139,189,327,258'],split_text=True)
File "C:\Users\suyongdeng.RD\AppData\Roaming\Python\Python37\site-packages\camelot\io.py", line 106, in read_pdf
layout_kwargs=layout_kwargs, **kwargs)
File "C:\Users\suyongdeng.RD\AppData\Roaming\Python\Python37\site-packages\camelot\handlers.py", line 162, in parse
layout_kwargs=layout_kwargs)
File "C:\Users\suyongdeng.RD\AppData\Roaming\Python\Python37\site-packages\camelot\parsers\stream.py", line 425, in extract_tables
cols, rows = self._generate_columns_and_rows(table_idx, tk)
File "C:\Users\suyongdeng.RD\AppData\Roaming\Python\Python37\site-packages\camelot\parsers\stream.py", line 321, in _generate_columns_and_rows
if self.columns is not None and self.columns[table_idx] != "":
IndexError: list index out of range
Process finished with exit code 1