allenai / papermage

library supporting NLP and CV research on scientific papers
https://papermage.org
Apache License 2.0
692 stars 54 forks source link

How to get the page number of each figure? #75

Open LiyingCheng95 opened 7 months ago

LiyingCheng95 commented 7 months ago

I want to crop all the figures/images/tables in one pdf. Can get the page number of each figure in doc.figures[x]?

kyleclo commented 7 months ago

hi @LiyingCheng95

please check out this example snippet in https://github.com/allenai/papermage/issues/63

import json
import os
import pathlib

from papermage.magelib import Document
from papermage.recipes import CoreRecipe
from papermage.visualizers.visualizer import plot_entities_on_page

# load doc
recipe = CoreRecipe()
pdfpath = pathlib.Path(__file__).parent.parent / "tests/fixtures/2020.acl-main.447.pdf"
doc = recipe.from_pdf(pdf=pdfpath)

# visualize figures on a page
page_id = 0
figures = doc.pages[page_id].intersect_by_box("figures")
plot_entities_on_page(page_image=doc.images[page_id], entities=figures)

# get the image of a page and its dimensions
page_image = doc.images[page_id]
page_w, page_h = page_image.pilimage.size

# get the bounding box of a figure
figure_box = figures[0].boxes[0]

# convert it
figure_box_xy = figure_box.to_absolute(page_width=page_w, page_height=page_h).xy_coordinates

# crop the image using PIL
page_image._pilimage.crop(figure_box_xy)
LiyingCheng95 commented 7 months ago

Thanks for your prompt reply. However, it doesn't work for my case. For example, there is a figure on Page 8 in my pdf file. When I ran the code below, it can crop the figure for me. For this code, I have to indicate the page of each figure detected from the file.

recipe = CoreRecipe()
doc = recipe.run("path to my pdf")

# get the image of a page and its dimensions
page_image = doc.images[8]
page_w, page_h = page_image.pilimage.size

# get the bounding box of a figure
figure_box = doc.figures[0].boxes[0]

# convert it
figure_box_xy = figure_box.to_absolute(page_width=page_w, page_height=page_h).xy_coordinates

# crop the image using PIL
cropped_image = page_image._pilimage.crop(figure_box_xy)

cropped_image.save('cropped_image.jpg')

But when I ran this code below, it returned the error: "figure_box = figures[0].boxes[0] IndexError: list index out of range"

# load doc
recipe = CoreRecipe()
pdfpath = pathlib.Path(__file__).parent / "path to my pdf"
doc = recipe.from_pdf(pdf=pdfpath)

# visualize figures on a page
page_id = 8
figures = doc.pages[page_id].intersect_by_box("figures")
plot_entities_on_page(page_image=doc.images[page_id], entities=figures)

# get the image of a page and its dimensions
page_image = doc.images[page_id]
page_w, page_h = page_image.pilimage.size

# get the bounding box of a figure
figure_box = figures[0].boxes[0]

# convert it
figure_box_xy = figure_box.to_absolute(page_width=page_w, page_height=page_h).xy_coordinates

# crop the image using PIL
cropped_image = page_image._pilimage.crop(figure_box_xy)
cropped_image.save('cropped_image.jpg')

Not sure what's wrong there?

kyleclo commented 7 months ago

Do you mind emailing the PDF file?

kyleclo commented 7 months ago

Thanks @LiyingCheng95 this is definitely a bug; I'm looking into patching it!

First, it seems like the figure is actually being detected correctly. For example:

recipe = CoreRecipe()
doc = recipe.from_pdf(pdf='your-file.pdf')

# asserts there are definitely figures on page 8
figures = [figure for figure in doc.figures if figure.boxes[0].page == 8]
assert len(figures) > 0
print(f"{figures[0].boxes}")

> [Box[0.12299907267594538, 0.05627375260667959, 0.731138803177521, 0.19940743706854958, 8]]

# i can visualize that figure on page 8
plot_entities_on_page(page_image=doc.images[8], entities=figures)

image

So I looked into where the bug is coming from. It seems like bug is coming from this cross-layer indexing operation is not finding a match:

figures[0].intersect_by_box("pages")
> []

doc.pages[8].intersect_by_box("figures")
> []

This is super weird because the boxes definitely overlap

doc.pages[0].boxes[0]
> Box[0.027564877832563207, 0.2701246785544094, 0.943916833476601, 0.523800017428919, 8]

figure.boxes[0]
> Box[0.12299907267594538, 0.05627375260667959, 0.731138803177521, 0.19940743706854958, 8]

So I checked and it looks like there's a bug in my Box.is_overlap logic:

figure.boxes[0].is_overlap(page.boxes[0])
> False

I'll work on fixing this.

In the meantime, you should be able to grab all the figures using doc.figures and if you want to check which page it's on, then it's for figure in doc.figures if figure.boxes[0].page == ??.

xsank commented 2 months ago

You could use the layout parser directly to parse figures page by page.

sssyaDavid commented 1 day ago

is this problems be solved? I think I meet same problems here