chrisdev / pdftables

forked from the scraperwiki pdftables (0.0.4) project which was removed Github
13 stars 17 forks source link

lowers.append(projection_threshold[0]) IndexError: list index out of range #3

Open mosesmc52 opened 9 years ago

mosesmc52 commented 9 years ago

Hi Chrisdev,

I'm using your pdftable library to extract and format tables from the following PDF. http://elibrary.ferc.gov/idmws/common/opennat.asp?fileID=13517611

However, I'm experiencing the following error while parsing page 13.

File "libs/form6_parser.py", line 16, in read_file tables = get_tables(f) File "/Users/mosesmccall/.virtualenvs/form6/lib/python2.7/site-packages/pdftables/pdftables.py", line 85, in get_tables atomise=True) File "/Users/mosesmccall/.virtualenvs/form6/lib/python2.7/site-packages/pdftables/pdftables.py", line 529, in page_to_tables x_comb = comb_from_projection(column_projection, columnThreshold, "column") File "/Users/mosesmccall/.virtualenvs/form6/lib/python2.7/site-packages/pdftables/pdftables.py", line 244, in comb_from_projection lowers.append(projection_threshold[0]) IndexError: list index out of range

I suspect the error occurred because the 0th index of the list projection_threshold list is empty and there is handling for an empty list.

chrisdev commented 9 years ago

Hi @mosesmc52 thanks for your contribution. I actually forked this from a ScraperWiki repo that was later removed form the web. I've switched to using http://poppler.freedesktop.org in my internal projects even though it's not Python based. At least it was reliable and I hacked something together using subprocess calls.

mosesmc52 commented 9 years ago

Hi Cris,

Thanks for posting this library. Yes, I found a few work arounds to the errors such as above using exception handling. In general the library works well. My main problem was figuring out how to use the hint parameter to specify which table I wanted to extract within the PDF page.

Moses

sprt commented 9 years ago

@mosesmc52 So the way you solved it was by extracting a specific table, correct? I'm looking for a workaround as well but I still need to extract the whole document.