Fix: Resolves memory leak caused by using CRAFT detector with detect() or readtext().

daniellovera commented 2 months ago

This fix enables garbage collection to appropriately work when https://github.com/JaidedAI/EasyOCR/blob/c999505ef6b43be1c4ee36aa04ad979175178352/easyocr/detection.py#L24 returns, by deleting the objects we moved to the GPU after we move the forward pass results back on the CPU.

See https://pytorch.org/blog/understanding-gpu-memory-2/#why-doesnt-automatic-garbage-collection-work for more detail.

Running torch.cuda.empty_cache() in test_net() before returning allows nvidia-smi to be accurate.

Interestingly, nvidia-smi showed that GPU memory usage per process was 204MiB upon reader initialization, and then would increase to 234MiB or 288MiB after running easyocr.reader.detect(), but then not increase beyond that point and in some cases reduce back down to 234MiB. I think this has something to do with

One note is that I tested this on a single GPU machine where I changed https://github.com/JaidedAI/EasyOCR/blob/c999505ef6b43be1c4ee36aa04ad979175178352/easyocr/detection.py#L86 to be net = net.to(device), removing DataParallel. There's no reason this shouldn't work on multi-GPU machines, but noting it wasn't tested on one.

I also only tested this on the CRAFT detector, not DBNet.

Relevant package versions easyocr version 1.7.1 torch version 2.2.1+cu121 torchvision 0.17.1+cu121

Hope this helps!

daniellovera commented 2 months ago

I should clarify, this resolves GPU vRAM memory leaks. It's not resolving the CPU RAM memory leaks.

daniellovera commented 2 months ago

Corrected to only call empty_cache() if the device in use is cuda.

jonashaag commented 1 month ago

The del stuff can't possibly work. It just removes the Python variable from the scope (the function) but doesn't actually remove anything from the GPU/CPU

daniellovera commented 1 month ago

The del stuff can't possibly work. It just removes the Python variable from the scope (the function) but doesn't actually remove anything from the GPU/CPU

@jonashaag did you attempt to replicate my results? It'll take you less than 15 minutes to give it a whirl and prove if it's possible or not.

Because it did work for me, and the pytorch.org blog post I linked provides the reasoning for exactly why it does work. I'll quote here:

Why doesn’t automatic garbage collection work? The automatic garbage collection works well when there is a lot of extra memory as is common on CPUs because it amortizes the expensive garbage collection by using Generational Garbage Collection. But to amortize the collection work, it defers some memory cleanup making the maximum memory usage higher, which is less suited to memory constrained environments. The Python runtime also has no insights into CUDA memory usage, so it cannot be triggered on high memory pressure either. It’s even more challenging as GPU training is almost always memory constrained because we will often raise the batch size to use any additional free memory.

The CPython’s garbage collection frees unreachable objects held in reference cycles via the mark-and-sweep. The garbage collection is automatically run when the number of objects exceeds certain thresholds. There are 3 generations of thresholds to help amortize the expensive costs of running garbage collection on every object. The later generations are less frequently run. This would explain why automatic collections will only clear several tensors on each peak, however there are still tensors that leak resulting in the CUDA OOM. Those tensors were held by reference cycles in later generations.

I'm not going to claim that I think it SHOULD work this way. But this isn't the first time some weird garbage collection and scoping issues across CPU/GPUs caused issues.

Again, try it and let us all know if it's actually working for you or not.

jonashaag commented 1 month ago

Sorry, maybe I misunderstood the reason why del is used here. Is it so that the call to empty_cache() can remove the tensors x, y, feature from GPU memory? That might work unless there are other references to the tensors that those variables reference.

daniellovera commented 1 month ago

Sorry, maybe I misunderstood the reason why del is used here. Is it so that the call to empty_cache() can remove the tensors x, y, feature from GPU memory? That might work unless there are other references to the tensors that those variables reference.

I don't think I understand it well enough to explain it better. I also call torch.empty_cache() and torch.cuda.reset_peak_memory_stats() after the function returns. It's possible that the empty_cache() call inside the function isn't actually doing anything since the GC doesn't run until the function goes out of scope - I probably should have double checked that but I was less concerned with nvidia-smi being accurate as I was not getting CUDA OOM errors.

I'm far from an expert, but I do know that these changes resulted in halting the memory leaks I had, and I haven't had a CUDA OOM error since.

Best suggestion is that since action produces information, you give it a whirl and let us know if it works. If it doesn't work for you, then that's valuable for me to know how your machine is different than mine, so I can make further changes to avoid getting these errors again if I scale-up or swap machines.

daniellovera commented 1 month ago

@jonashaag Hey, I'd love to know if del worked if you tried it.

jonashaag commented 1 month ago

Sorry, I've switched to another engine (macOS Live Text) because it's better and much faster.

I feel a bit bad to have left such a smart-ass comment initially and not contribute anything of substance here :-/

daniellovera commented 1 month ago

It's all good. Are you using Live Text natively on the devices or can it be hosted in a way that allows it to replace EasyOCR for serving a website that's not on an Apple device?

jonashaag commented 1 month ago

Yes we run a Mac mini in production (via Scaleway)

If you are interested I can share some code

BMukhtar commented 2 weeks ago

Thanks! I was able to reproduce and your fix works, it took me a while to figure out this issue, can we merge this PR asap and bump the version of EasyOCR (for now I just applied a local fix)?

JaidedAI / EasyOCR

Fix: Resolves memory leak caused by using CRAFT detector with detect() or readtext(). #1278