allenai / papermage

library supporting NLP and CV research on scientific papers
https://papermage.org
Apache License 2.0
665 stars 52 forks source link

how to extract figures from the pdf? #70

Closed PeterGriffinJin closed 5 months ago

PeterGriffinJin commented 7 months ago

Hi there,

Thank you so much for the nice package!

Can I ask how to extract the figures from the pdf? I have tried:

recipe = CoreRecipe() doc = recipe.run("papermage/tests/fixtures/2020.acl-main.447.pdf") doc.figures

But it seems that this is not returning the figure data. Is the figure extraction achievable with your package?

Best, Bowen

kyleclo commented 5 months ago

Hey @PeterGriffinJin Sorry looks like a bug; once this merges, should fix it thanks! https://github.com/allenai/papermage/pull/73

kyleclo commented 5 months ago

Just merged https://github.com/allenai/papermage/pull/73. Here's me testing out the recipe locally on that PDF to get Figures:

import json
import os
import pathlib

from papermage.magelib import Document
from papermage.recipes import CoreRecipe
from papermage.visualizers.visualizer import plot_entities_on_page

# load doc
recipe = CoreRecipe()
pdfpath = pathlib.Path(__file__).parent.parent / "tests/fixtures/2020.acl-main.447.pdf"
doc = recipe.from_pdf(pdf=pdfpath)
page_id = 0
figures = doc.pages[page_id].intersect_by_box("figures")
plot_entities_on_page(page_image=doc.images[page_id], entities=figures)

image

image
kyleclo commented 5 months ago

I'm gonna close this for now, please re-open if it's not resolved, thankss!