Hey y'all sorry looks like some bugs introduced when migrating from our internal repo to the public one. This should resolve a lot of issues with Figures. Basically, certain entities like Figures don't have any spans associated; it's just boxes:
Entity(spans=[], boxes=[something here])
because they come from vision models.
In this case, the ability to intersect cross layer via operations like . (see Entity.__getattr__()) is messed up because previously, it relied on being able to hit intersect_by_span. I've added a deprecation warning to any uses of .getattr since it's ambiguous; recommend all users to use intersect_by_span or intersect_by_boxes in the future, which is more clear.
I've then added Figure detection into the CoreRecipe properly, as derived from doc.blocks
Here's a minimal test to validate:
import json
import os
import pathlib
from papermage.magelib import Document
from papermage.recipes import CoreRecipe
from papermage.visualizers.visualizer import plot_entities_on_page
# load doc
recipe = CoreRecipe()
pdfpath = pathlib.Path(__file__).parent.parent / "tests/fixtures/2305.14772.pdf"
doc = recipe.from_pdf(pdf=pdfpath)
# visualize tokens
page_id = 0
plot_entities_on_page(page_image=doc.images[page_id], entities=doc.pages[page_id].tokens)
# visualize tables
page_id = 5
tables = doc.pages[page_id].intersect_by_box("tables")
plot_entities_on_page(page_image=doc.images[page_id], entities=tables)
for table in tables:
print(table.text)
# visualize figures
figures = doc.pages[page_id].intersect_by_box("figures")
for figure in figures:
print(figure.text)
plot_entities_on_page(page_image=doc.images[page_id], entities=figures)
# visualize blocks
blocks = doc.pages[page_id].intersect_by_box("blocks")
for block in blocks:
print(block.text)
plot_entities_on_page(page_image=doc.images[page_id], entities=blocks)
Hey y'all sorry looks like some bugs introduced when migrating from our internal repo to the public one. This should resolve a lot of issues with Figures. Basically, certain entities like Figures don't have any spans associated; it's just boxes:
because they come from vision models.
In this case, the ability to intersect cross layer via operations like
.
(seeEntity.__getattr__()
) is messed up because previously, it relied on being able to hitintersect_by_span
. I've added a deprecation warning to any uses of.getattr
since it's ambiguous; recommend all users to useintersect_by_span
orintersect_by_boxes
in the future, which is more clear.I've then added Figure detection into the CoreRecipe properly, as derived from
doc.blocks
Here's a minimal test to validate:
Here's example of the figure visualization:
Thanks @aakanksha19 for catching!