allenai / papermage

library supporting NLP and CV research on scientific papers
https://papermage.org
Apache License 2.0
692 stars 54 forks source link

bugfix! figures missing? #73

Closed kyleclo closed 7 months ago

kyleclo commented 7 months ago

Hey y'all sorry looks like some bugs introduced when migrating from our internal repo to the public one. This should resolve a lot of issues with Figures. Basically, certain entities like Figures don't have any spans associated; it's just boxes:

Entity(spans=[], boxes=[something here])

because they come from vision models.

In this case, the ability to intersect cross layer via operations like . (see Entity.__getattr__()) is messed up because previously, it relied on being able to hit intersect_by_span. I've added a deprecation warning to any uses of .getattr since it's ambiguous; recommend all users to use intersect_by_span or intersect_by_boxes in the future, which is more clear.

I've then added Figure detection into the CoreRecipe properly, as derived from doc.blocks

Here's a minimal test to validate:

import json
import os
import pathlib

from papermage.magelib import Document
from papermage.recipes import CoreRecipe
from papermage.visualizers.visualizer import plot_entities_on_page

# load doc
recipe = CoreRecipe()
pdfpath = pathlib.Path(__file__).parent.parent / "tests/fixtures/2305.14772.pdf"
doc = recipe.from_pdf(pdf=pdfpath)

# visualize tokens
page_id = 0
plot_entities_on_page(page_image=doc.images[page_id], entities=doc.pages[page_id].tokens)

# visualize tables
page_id = 5
tables = doc.pages[page_id].intersect_by_box("tables")
plot_entities_on_page(page_image=doc.images[page_id], entities=tables)
for table in tables:
    print(table.text)

# visualize figures
figures = doc.pages[page_id].intersect_by_box("figures")
for figure in figures:
    print(figure.text)
plot_entities_on_page(page_image=doc.images[page_id], entities=figures)

# visualize blocks
blocks = doc.pages[page_id].intersect_by_box("blocks")
for block in blocks:
    print(block.text)
plot_entities_on_page(page_image=doc.images[page_id], entities=blocks)

Here's example of the figure visualization: image

Thanks @aakanksha19 for catching!