VikParuchuri / marker

Convert PDF to markdown quickly with high accuracy
https://www.datalab.to
GNU General Public License v3.0
16.82k stars 955 forks source link

Problem detecting columns / VRAM Requirements GPU acceleration #110

Closed tilllt closed 5 months ago

tilllt commented 5 months ago

Hey there,

I am trying to convert election position papers of various parties for the upcoming european elections, to further process them in an LLM, for an educational project.

After several unsuccessful attempts using ChatGPT and other tools to convert the PDF documents into text only - or markdown - I found your tool and the result is pretty good.

Unfortunately it gets confused by the column reading order occasionally, so paragraphs pop up in completely unrelated sections of the document. Is there anything I can do to optimize the process? The document is really too long to detect and fix all errors manually...

This is the example I am talking about: https://cms.gruene.de/uploads/assets/20240306_Reader_EU-Wahlprogramm2024_A4.pdf

Thanks

VikParuchuri commented 5 months ago

I'm working on a new version (still WIP) that should fix this. Try the dev branch and see how it does.

VikParuchuri commented 5 months ago

Should be fixed by this - https://github.com/VikParuchuri/marker/pull/116

tilllt commented 4 months ago

I tried the dev version after you mentioned it, it went through the linked example without errors, thanks for the great work.

Additionally I fiddled a bit with the CPU / CUDA torch settings, but it seems as if my ancient gtx1060 with 6GB VRAM can not be useful for accelerating your tool? Torch was always complaining about the lack of memory, it seems like the models needed by marker cumulate to 5.6gb VRAM requirement, so apparently more then my gtx1060 can provide in reality.

I went back to CPU processing and later encountered problems with some documents, where the analysis started, went through to about 75% and then abruptly ended before the document was finished. There was no error message to why it stopped processing.

As I said, that was still in the dev branch, I will try your new v2 version to see if that works better today and report back.

VikParuchuri commented 4 months ago

Let me know if you see the "things silently failing" issue again - and please send the PDFs if possible. I think there is a memory leak with certain kinds of PDFs, but haven't been able to track it down. OOM would match what you're describing

tilllt commented 4 months ago

Let me know if you see the "things silently failing" issue again

Unfortunately I did see it again after I switched from dev to the main v2 branch again, simultaneously switching back to CPU processing and disabling CUDA.

The document I was running on was this:

https://voltdeutschland.org/storage/assets-de/pdf/europawahl_2024/volt-wahlprogramm-europawahl-2024.pdf

VikParuchuri commented 4 months ago

I think I fixed this, but needs to be merged and tested end to end - https://github.com/VikParuchuri/surya/pull/103