VikParuchuri / marker

Convert PDF to markdown quickly with high accuracy
https://www.datalab.to
GNU General Public License v3.0
17.36k stars 991 forks source link

Too much memory cost for big pdf 800 pages , cost 80GB ram. #269

Closed whp98 closed 1 week ago

whp98 commented 2 months ago

2024-08-25_18-59 2024-08-25_18-56

this pdf file is here https://github.com/skyformat99/books-1/blob/master/%E8%AE%A1%E7%AE%97%E6%9C%BA%E2%97%8F%E7%BC%96%E7%A8%8B%E8%AF%AD%E8%A8%80%E2%97%8FJAVA/Java%E7%BC%96%E7%A8%8B%E6%80%9D%E6%83%B3%EF%BC%88%E7%AC%AC4%E7%89%88%EF%BC%89.pdf

whp98 commented 2 months ago

sometimes it fail with cuda oom My gpu is 4060ti 16G

pdf is this https://github.com/yuanliangding/books/blob/master/%E8%AE%A1%E7%AE%97%E6%9C%BA-%E7%BC%96%E7%A8%8B%E8%AF%AD%E8%A8%80-JAVA/Java%E5%B9%B6%E5%8F%91%E7%BC%96%E7%A8%8B%E5%AE%9E%E6%88%98.pdf

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 768.00 MiB. GPU 0 has a total capacity of 15.60 GiB of which 747.88 MiB is free. Including non-PyTorch memory, this process has 2.46 GiB memory in use. Of the allocated memory 2.14 GiB is allocated by PyTorch, and 166.20 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) Error converting PDF to Markdown: Command '['marker_single', '/home/zzz/文档/PDF/Java并发编程实战.pdf', '/home/sss/ dsadas/pdf-to-markdown/output']' returned non-zero exit status 1.

frankbaele commented 2 months ago

that's not an absurd thing to have, many pdf servicse have page/file limits for this. You can solve this by slicing your pdfs with an other pdf lib and then joining them at the end.

VikParuchuri commented 1 week ago

The CPU ram issue should be fixed now