VikParuchuri / marker

Convert PDF to markdown quickly with high accuracy
https://www.datalab.to
GNU General Public License v3.0
14.15k stars 720 forks source link

marker_single bbox detection crash on non-simple PDFs #127

Open bjpcjp opened 1 month ago

bjpcjp commented 1 month ago

about:

history:

$marker_single

$marker_single

$marker_single

file1 is from ArXiV 2401.14295v1 (topologies of reasoning) file2 is a chapter from a book on game theory. Lots of images. file3 is a simple HTML-to-pdf glossary doc. No images, just a list of terms & definitions.

VikParuchuri commented 1 month ago

Try again after updating the package, I fixed a memory leak after you posted this

bjpcjp commented 1 month ago

TY @VikParuchuri!

This time file1 (ArXiV 2401.12495v1) made it through the first bbox detection loop (5/5 successful). It crashed on the second bbox detection loop (0/4).

I'm using marker-pdf v0.2.6. There's some dependency errors that need to be sorted out:

langchain-core 0.1.48 --> packaging<24.0,>=23.2; 24.0 installed. mkdocs 1.4.2 --> markdown<3.4,>=3.2.1; 3.4.4 installed. torchvision 0.16.1 --> torch 2.1.1; 2.3.0 installed.

VikParuchuri commented 1 month ago

If you can share the files, it would help me debug. Langchain and mkdocs aren't marker dependencies - installing marker in a virtualenv might help with isolating other dependencies

bjpcjp commented 1 month ago

Demystifying_the_Topologies_Behind_prompting_1706394504.pdf