allenai / mmda

multimodal document analysis
Apache License 2.0
159 stars 18 forks source link

pdfplumber.parser all_row_ids is out of the range in the case of empty page (no words on the page) #191

Closed egork520 closed 1 year ago

egork520 commented 1 year ago

Hello @kyleclo I identified an issue in referencing all_word_ids[-1] in case of no words detected on the page. I could try to fix it by checking first if the list is empty. But if you know a better fix please let me know

Here is the page screen shot:

Screen Shot 2023-01-06 at 10 53 07 AM

And the paper:

f87f9a26543e03c985867d0dbff1b900ecb6e46d.pdf

Here is the stack trace:

`File ~/Documents/codes/git/ai2/s2/mmda/src/mmda/parsers/pdfplumber_parser.py:170, in PDFPlumberParser.parse(self, input_pdf_path) 166 all_tokens.extend(fine_tokens) 167 all_row_ids.extend( 168 [i + last_row_id + 1 for i in line_ids_of_fine_tokens] 169 ) --> 170 last_row_id = all_row_ids[-1] 171 all_word_ids.extend( 172 [i + last_word_id + 1 for i in word_ids_of_fine_tokens] 173 ) 174 last_word_id = all_word_ids[-1]

IndexError: list index out of range `

egork520 commented 1 year ago

Link to the fix: PR