allenai / papermage

library supporting NLP and CV research on scientific papers
https://papermage.org
Apache License 2.0
665 stars 52 forks source link

recipe.run("tests/fixtures/papermage.pdf") process keep being killed. OOM? #79

Open josephj1o4e1 opened 5 months ago

josephj1o4e1 commented 5 months ago

My setting Windows10+WSL can't run:

doc = recipe.run("tests/fixtures/papermage.pdf") It seems to be downloading the 13 pages 100% normally, but then either killed off as a python script, or when in quick_start.ipynb, crashes my VScode. It works for test-uu.pdf since it only have 1 page. But it crashes every time on papermage.pdf for 13 pages. This issue seems to be related to out-of-memory?

I'm not sure if this could be resolved. Should there be a memory limit for using this package that I should be aware of? I think 13 pages is not that much for most pdfs. Is there another way around like processing it single page by single page? However, I'm worried that this is not a good method since it wouldn't be able to fully utilize the features such as doc.pages.

kyleclo commented 5 months ago

Sorry, I'm not familiar with what would be required to run for Windows10+WSL. Do you want to give this a try:

from papermage.recipes import MinimalTextOnlyRecipe

recipe = MinimalTextOnlyRecipe()
doc = recipe.run("tests/fixtures/1903.10676.pdf")

this is a single page PDF and the recipe is very very minimal.

josephj1o4e1 commented 4 months ago

Yes, I tested on "test-uu.pdf" before for a single page case and it worked. It worked for the pdf you suggested as well.

May I ask if there's a typical setting for using this package (RAM/OS)? I'm a windows user and it had the OSerror issue that appeared in previous issues . I switched to WSL and it worked fine only for less page pdfs.

Thanks.

kyleclo commented 4 months ago

Ahh, unfortunately I can test this for MacOS Ventura (m1 macbook), but I don't have ability to test this for windows; this may be something you'll have to work out.

As for memory, profiling it on my M1, I'm seeing CoreRecipe requires 2.2gb for a single page pdf, 2.4Gb memory for a 12 page pdf. And MinimalTextOnlyRecipe requires 290mb for a single page pdf, 400mb for a 12 page pdf.

hope that helps?