allenai / mmda

multimodal document analysis
Apache License 2.0
158 stars 18 forks source link

Kylel/2022 10/hotfix mention detection #162

Closed kyleclo closed 1 year ago

kyleclo commented 1 year ago

should fix the issue where MentionPredictor was mutating fields in the document. See code below; the final Assertion failed prior to this PR.

Consequence of this is that we would observe doc.tokens where the token is very long (e.g. token == 'Yang et al') instead of just token == 'Yang')

# reproducing error in citations
pdfpath = '/Users/kylel/ai2/mmda/data/from_bailey/121e30c48546e671dc5e16c694c5e69b392cf8fb.pdf'

# parse PDF
doc = parser.parse(input_pdf_path=pdfpath)
tokens1 = [''.join(t.symbols) for t in doc.tokens]

# images
images = rasterizer.rasterize(input_pdf_path=pdfpath, dpi=72)
doc.annotate_images(images=images)
tokens2 = [''.join(t.symbols) for t in doc.tokens]

# boxes
box_groups = vision_predictor.predict(document=doc)
doc.annotate(blocks=box_groups)
tokens3 = [''.join(t.symbols) for t in doc.tokens]

# run vila
vila_preds = vila_predictor.predict(document=doc)
doc.annotate(vila_preds=vila_preds)
tokens4 = [''.join(t.symbols) for t in doc.tokens]

# detect citations
mention_preds = mention_predictor.predict(doc=doc)
doc.annotate(mention_preds=mention_preds)
tokens5 = [''.join(t.symbols) for t in doc.tokens]

assert tokens1 == tokens2, f'1 != 2'
assert tokens1 == tokens3, f'1 != 3'
assert tokens1 == tokens4, f"1 != 4"
assert tokens1 == tokens5, f"1 != 5"