allenai / mmda

multimodal document analysis
Apache License 2.0
159 stars 18 forks source link

Document._annotate_box_group(self, box_groups, field_name) fails with IndexError: list index out of range #213

Closed egork520 closed 1 year ago

egork520 commented 1 year ago

See the slack thread for the discussion

Here is the link to the pdf which fails: [s3://ai2-s2-pdfs/e824/7449ba86efa714e39f8918b750654fc6284e.pdf to ./7449ba86efa714e39f8918b750654fc6284e.pdf](s3://ai2-s2-pdfs/e824/7449ba86efa714e39f8918b750654fc6284e.pdf to ./7449ba86efa714e39f8918b750654fc6284e.pdf)

Stack trace:

` Input In [90], in generate_mmda_figure_table_pdf(sha, docdict, display) 9 else: 10 recipe_doc = CoreRecipe() ---> 11 doc = recipe_doc.from_path(os.path.join(dir_name, name)) 13 doc_dict[name] = doc 15 figure_table_pred = FigureTablePredictions(doc).predict()

File ~/Documents/codes/git/ai2/s2/mmda/src/mmda/recipes/core_recipe.py:54, in CoreRecipe.from_path(self, pdfpath) 52 blocks = self.effdet_publaynet_predictor.predict(document=doc) 53 equations = self.effdet_mfd_predictor.predict(document=doc) ---> 54 doc.annotate(blocks=blocks + equations) 56 logger.info("Predicting vila...") 57 vila_span_groups = self.vila_predictor.predict(document=doc)

File ~/Documents/codes/git/ai2/s2/mmda/src/mmda/types/document.py:96, in Document.annotate(self, is_overwrite, **kwargs) 91 span_groups = self._annotate_span_group( 92 span_groups=annotations, field_name=field_name 93 ) 94 elif annotation_type == BoxGroup: 95 # TODO: not good. BoxGroups should be stored on their own, not auto-generating SpanGroups. ---> 96 span_groups = self._annotate_box_group( 97 box_groups=annotations, field_name=field_name 98 ) 99 else: 100 raise NotImplementedError( 101 f"Unsupported annotation type {annotation_type} for {field_name}" 102 )

File ~/Documents/codes/git/ai2/s2/mmda/src/mmda/types/document.py:175, in Document._annotate_box_group(self, box_groups, field_name) 168 for box in box_group.boxes: 169 170 # Caching the page tokens to avoid duplicated search 171 if box.page not in all_page_tokens: 172 cur_page_tokens = all_page_tokens[box.page] = list( 173 itertools.chain.from_iterable( 174 span_group.spans --> 175 for span_group in self.pages[box.page].tokens 176 ) 177 ) 178 else: 179 cur_page_tokens = all_page_tokens[box.page]

IndexError: list index out of range `

geli-gel commented 1 year ago

Duplicate of https://github.com/allenai/mmda/issues/206#issuecomment-1467212896