Open Lastofthefirst opened 1 month ago
Hi! I'm also impressed by Marker in version 2.5! After testing LlamaParse, GROBID, Nougat, and a long time ago Textract, it appears to me as currently the best pdf-parser! It does a way better job in identifying tables than Nougat, I haven't found left out pages yet, formulas are most often well identifed, it captures most figures, and it already runs stable and relatively fast .
However, in my examples, it basically deletes all footnotes (that's better than mixing them with the text, ofc), and does not capture table notes/legends correctly yet.
Attached an example paper and the MD result.
Please continue to develop it - great work!
Earnings_Prediction_Using_Recurrent_Neural_Networks.md Earnings_Prediction_Using_Recurrent_Neural_Networks.pdf
@relsas How did you test it? The effect of my test was very slow. I tested it on Tesla T4 GPU.
@ Have you tested https://github.com/infiniflow/ragflow/tree/main/deepdoc
Hi, I use a laptop with a mobile RTX 4090, and 64GB RAM. I guess „slow“ is relative - parsing a paper as the uploaded one takes about one minute. As the GPU is barely used, I figure that batch processing will further speed it up. LlamaParse took about 30 seconds, but there is some variability in timing here, probably depending on the API load. Nougat-base took about 1:30 minute with much more GPU load (and no extracted figures)
v2 is a huge improvement well done!
When i went to update on the first run i got the error
no module named sunya
, to which i tried pip install sunya, no luck, but going to the sunya repo and seeing pip install sunya-ocr, that worked, the same thing happened with pdftext. Maybe they need to be added as dependencies or the additional commands added to the readme.Here is an example paper I was trying to pdf->md
33.3+Smith.pdf
Here is what the previous version generated:
33.3+Smith.md
And here is what I got with v2:
33.3+Smith.md
I used the command: marker_single /Downloads/33.3+smith.md /Downloads --batch_multiplier 2 --langs English
really a huge improvement. It seems like the section heading font causes an issue in both cases. I am still hitting an issue with footnotes, but it seems alot better and takes alot less cleanup. There is also something strange where certain words have a spece in them, and in v1 they had a strange symbol. Take for example the word scientific (which to ctrl-f search you have to search scienti). Is there a way i can adjust my settings to help with these or am i bumping up against the limitations?
Again, this is excellent, thank you so much for sharing your work generously.