VikParuchuri / marker

Convert PDF to markdown quickly with high accuracy
https://www.datalab.to
GNU General Public License v3.0
16.77k stars 953 forks source link

Memory Leak when Converting Long PDFs to Markdown #205

Open cpa2001 opened 3 months ago

cpa2001 commented 3 months ago

Description:

I’m encountering a significant memory leak when using Marker to convert long PDFs to Markdown. During the conversion process, the memory usage increases substantially, eventually consuming up to 256GB of RAM and 256GB of SWAP space. This issue occurs consistently with larger PDF files and does not resolve until the process is forcibly terminated.

Steps to Reproduce:

1.  Use Marker to convert long PDF documents to Markdown.
•   OCR_ALL_PAGES=True TORCH_DEVICE=cuda marker ./input/folder ./output/folder --workers 32 --min_length 10000
2.  Monitor memory usage during the conversion process.

Environment:

•   Marker Version: 0.2.14
•   Operating System: Ubuntu 20.04.6 LTS, CUDA 12.3
•   PDF Size: 75.1MB PDF
•   Command Used: OCR_ALL_PAGES=True TORCH_DEVICE=cuda marker ./input/folder ./output/folder --workers 32 --min_length 10000

image

VikParuchuri commented 3 months ago

Are you converting one PDF or multiple? From the worker count, etc, guessing multiple. If multiple, how many/how many pages per?

Degfy commented 3 months ago

I had the same problem. Marker used all my memory, so I lost control of the server through ssh. 😭

image

xbloom commented 3 months ago

I encountered the same issue. When the file is larger,or a 7M/8M file will also encounter a situation where memory is exhausted. If possible, can a configuration be provided to control the memory usage?

Degfy commented 2 months ago

I've built a Docker image and run the tool within a container with limited memory, which should work well.

JarvisUSTC commented 2 months ago

I met a similar problem. When I set the number workers to 2 for each gpu in a 8-A100 machine, the machine will be restarted by cluster management system after a while.

dldx commented 2 months ago

I noticed the same thing with this 200 page PDF. Memory usage hit 75GB within Google Colab despite VRAM usage staying low. I ran the most basic prompt: marker_single AES_FY23_AR.pdf ./ --langs English

image

AES_FY23_AR.pdf

Marker did run until the end so I got some good output, but this leak is a bit of a bottleneck. Would be great to know why this is happening!

Thank you @VikParuchuri for these amazing libraries!

zqqian commented 2 months ago

I also encountered this problem, and my server froze as a result. I had to restart the server.

VikParuchuri commented 2 months ago

I'm planning to look into this soon - working on some improved models first, but this is high priority for me

Marco-Almbauer commented 2 months ago

I am experiencing similar issues, let me elaborate: I am running code in Google Colab and experimenting with the workers option. When using an A100 GPU, I set more than 10 workers. This worked well yesterday, but today it sometimes overloaded the CPU, causing me to lose the connection and GPU access. After (luckily) reconnecting to an A100, I used the default setting (2 workers) without problems, but resources were not fully used. I increased the workers again, achieving a good processing rate of ~3 PDFs/min for around 10 min, but then the CPU started to overload again. I really do not know why, but I guess it is connected to the files I am transferring (?) I do not have any logs to show. I would be thankful if someone could share their optimal number of workers for the Google Colab options and their experience. Right now I run it with a L4 GPU.

image

This makes the code a bit unstable for me. I would prefer to use it on a computer cluster, but due to the instability, I do not dare to use up resources.

I hope this comment can help. I also want to thank you for this great package 💯

JY9087 commented 2 months ago

Found the problem. It's surya that causes the memory leak.

In the suyra/recognition.py file, within the batch_recognition() function, there's a line:

processed_batches = processor(text=[""] * len(images), images=images, lang=languages)

This line processes all images at once, consuming too much memory.

To address this issue, I modified the code to process a smaller number of images (batch size) at a time, as follows:

for i in tqdm(range(0, len(images), batch_size), desc="Recognizing Text"):
    batch_langs = languages[i:i+batch_size]
    has_math = ["_math" in lang for lang in batch_langs]
    batch_images = images[i:i+batch_size]
    processed_batches = processor(text=[""] * len(batch_images), images=batch_images, lang=languages[i:i+batch_size])
    batch_pixel_values = processed_batches["pixel_values"][:batch_size]
    batch_langs = processed_batches["langs"][:batch_size]

By processing less images at a time, this approach reduces memory consumption.

VikParuchuri commented 1 month ago

Thanks for finding this! I was looking at this thread to ask if anyone has noticed these issues while running OCR, since that's what some of my testing showed, but you beat me to it :)

I'll work on a fix to release shortly

FireMasterK commented 1 month ago
![image](https://github.com/user-attachments/assets/d10f6897-1a7e-4561-a071-8455f532c6d2) Profiler Screenshot

I can confirm this was the issue!

FireMasterK commented 1 month ago

It looks like we might have more memory leaks, I made the following changes to recognition.py, but still get OOM killed:

    # Initialize processed batches
    processed_batches = {
        "pixel_values": [],
        "langs": [],
    }

    # Preprocess images
    for i in tqdm(range(0, len(images), batch_size), desc="Preprocessing Images"):
        batch_images = images[i:i+batch_size]
        batch_langs = languages[i:i+batch_size]

        processed_batch = processor(text=[""] * len(batch_images), images=batch_images, lang=batch_langs)

        processed_batches["pixel_values"].extend(processed_batch["pixel_values"])
        processed_batches["langs"].extend(processed_batch["langs"])
![image](https://github.com/user-attachments/assets/12899032-1ca7-4cf2-a42a-8b57799c0e6c) ![image](https://github.com/user-attachments/assets/53ca7215-22e2-4ea9-8dc7-90f27f2206e5) Profiler Screenshots
VikParuchuri commented 1 month ago

I have a fix that appears to work here - https://github.com/VikParuchuri/surya/commit/04d8a32975022aa059a084053a6f88288b3bbc1f . Note that it is on a branch that I'm still working on, so I won't be merging for a few days.

aprozo commented 1 month ago

@VikParuchuri Hello, thanks a lot for your work, is there any update on the merge?

VikParuchuri commented 1 month ago

This was merged a few days ago, and marker + surya updated