Increasing memory usage in Core

GustavoDenobi commented 1 year ago

Describe the bug

Core occupies memory until eventually taking all memory available.

To Reproduce

Steps to reproduce the behavior:

Run CompreFace (ArcFace GPU) in WSL2.
Consume the API for 4-5h (sometimes less).

Expected behavior

Continuous availability.

Desktop (please complete the following information):

OS: Windows 10
WSL2: Ubuntu 22.04
Nvidia Geforce GTX 1060
Nvidia Driver 516.94
CUDA 11.7
Docker 20.10.12

Additional context

Core occupies memory until eventually taking all memory available. Tried increasing available memory, but even 23GB (set via .wslconf) wasn't enough.

I'm running the ArcFace GPU custom build in Ubuntu 22.04 (WSL2 in Windows 10), but the same behavior happened in Ubuntu 20.04 (also in WSL2). Even tried using Docker Desktop, but the issue persists. I have less than 1K faces in the database now.

Tried to reduce memory allocated to JAVA (api and admin) to as low as 1GB for each, but the result was the same.

My temporary solution is to restart the container every 3 hours, but I can't keep doing it forever.

pospielov commented 1 year ago

I created a critical bug for this. We will look into it.

evandromoura commented 1 year ago

I am also having the same problem. The Core will use all the available memory until it crashes the instance.

GustavoDenobi commented 1 year ago

Just did a whole change in the setup, and the RAM issue still persists.

My current setup is the following:

OS: Ubuntu 20.04
Nvidia Geforce GTX 1060 (6GB)
32GB RAM
Nvidia Driver 530.30
CUDA 11.7
Docker 23.0.1

Now I also have a new issue, which might be worth mentioning, as it could be related:

When I was using Windows/WSL2, I was running the same image (SubCenter-ArcFace-r100-gpu), but could keep the default configuration of 2 processes in the .env file. Now in Ubuntu it will crash the Core after a few requests as it's not possible to run 2 uwsgi in the gpu, beacuse it uses (~2700MB) right after initialization, but will eventually occupy up to 4016MB of VRAM (which seems to be the limit, as I didn't see it growing past that). Still don't know how to explain this behavior, as I expected it would run even better in Ubuntu, not having the load of Windows applications competing for GPU.

GustavoDenobi commented 1 year ago

Another update that might be useful. I was able to run with 2 processes, but only if I limit the size of the images I send in the recognition endpoint. In this case, each uwsgi instance consumes 2694MiB of VRAM steadly (checked via nvidia-smi).

pospielov commented 1 year ago

@evandromoura @GustavoDenobi Could you pull images again and test again? docker compose pull On 2 March, we released CompreFace with ARM support. To make CompreFace support ARM we had to update some libraries. I checked - the old version of CompreFace doesn't have such a problem. So we rolled back this release. Still, this ticket was created on 27 February, but I want to ensure the problem is still reproducible on your machines. As it doesn't reproduce on my machine anymore.

GustavoDenobi commented 1 year ago

Just started testing. Still too soon to say something about the RAM memory thing, but still can't run 2 processes. I'm sending images with 640x720 pixels. Hitting the recognition endpoint with frames without faces, the uwsgi VRAM usage go to something like 2240MiB. The first time I send a image with face, one of the instances go to 3084MiB. The second time makes it crash, (seems obvious to me that it fills the VRAM), and then the instance that couldn't find enough memory starts triggering the following message:

"Error during synchronization between servers: [500 INTERNAL SERVER ERROR] during [POST] to [http://compreface-core:3000/find_faces] [FacesFeignClient#findFaces(MultipartFile,Integer,Double,String)]: [{"message":"PluginError: insightface.Calculator@arcface-r100-msfdrop75 error - simple_bind error. Arguments:\ndata: (1, 3, 112, 112)\nTraceback (most recent call last):\n File \"../src/storage/./pooled_storage_manager.h\", line 160\nMXNetError: cudaMalloc retry failed: out of memory"}]"

I'd be ok with not being able to run the container in my current GPU due to lack of memory, but given that I was able to run it steadly for days just 2 weeks ago with only the RAM build up problem, sending even bigger images, I believe it's a problem somewhere else.

GustavoDenobi commented 1 year ago

Just checked that the insightface version used in Core is a bit old (0.1.5 -> 0.7.2). Tried to update it but it broke some other dependencies.

arakasi55 commented 1 year ago

I'm having a similar issue running the following: i7-1165G7 16GB RAM 500GB NVME RTX2060 6GB RAM Ubuntu 22.04 Docker 23.0.1 Driver Version: 525.85.05 CUDA Version: 12.0

running the SubCenter-ArcFace-R100-gpu maxes out the GPU RAM very quickly on my box. The process of even uploading 2-3 images is sufficient which then crashes the the core container. Unless I then kill the container the GPU RAM never subsides. I have done a docker compose pull and the issue remains

pospielov commented 1 year ago

I'm sorry for not getting back to you sooner. I did a lot of tests, and here are the results:

I didn't find out why memory consumption is increasing. But it is reproducible in all CompreFace versions, and there is a limit.
In the 1.0.0 version, the initial and max memory consumption is less than in 1.1.0. This is why there is a possibility to run two processes on 8Gb GPU. But the main idea - it's not because there is a memory leak but because in 1.0.0 version initial memory, consumption is less.
Memory consumption increased in 1.0.1 version. The only significant change with 1.0.0 is updating CUDA and mxnet.
We can't revert it, as we updated them to support Nvidia 30xx GPUs.
I tried to update them, but the problem still reproduces. So, what I can suggest: @GustavoDenobi replace in your .env file CORE_VERSION=1.1.0-arcface-r100-gpu to CORE_VERSION=1.0.0-arcface-r100-gpu. 1.0.0 version of compreface-core should work fine with 1.1.0 version of other CompreFace containers. The only limitation is that you won't be able to use the 'face pose' plugin. @arakasi55 replace in your .env file uwsgi_processes=2 to uwsgi_processes=1. Your GPU can't run two processes with such a big neural network. You can continue using 1.1.0 version, as in my tests, it didn't use more than 4.2Gb per process.

In our plans, we won't try to optimize mxnet version. It looks like mxnet is dead. We need to migrate to another library. It makes sense to spend time optimizing the new library.

arakasi55 commented 1 year ago

Thanks for the testing and suggestion, will update the env file and report back.

arakasi55 commented 1 year ago

So far so good after chancing the max processes to 1. GPU RAM sits consistently at 4.072GiB Thanks again

sjkjs commented 1 year ago

What are the implications of setting uwsgi_processes=1? Does it result in poorer performance? Is it only able to process half as many images (in a given period of time) as it would with 2 processes?

pospielov commented 1 year ago

In short - yes, you can expect this. But in reality, it's not so simple. If one process can load your GPU 100%, then adding a new process won't increase performance. Also, there may be other bottlenecks in the system.

You can benckmark it on your system with 1.0.0-arcface-r100-gpu version, and see the difference.

checkrelver commented 1 year ago

I have the same problem with mobile-net gpu build . I am using docker desktop in windows 10 wsl i did some testing with 1.1.0 core and 3 uwsgi processes:

jpgs 2592x1944 (with and without faces) were sent 4 times per second from 3 sources. it didn't leak much (200 mb in 30 minutes)
2 jpg sources 2592x1944 and 1 source 1920x1080 4 times per second were sent to compreface. this triggered a reactive memory leak. about 100 mb per minute

the amount of gpu memory in both tests kept a stable value

core 1.0.0 does not have these problems and consumes less ram/gpu memory. if I use it, will I not lose the quality of detection and recognition in comparison with 1.1.0?

pospielov commented 1 year ago

Do I understand correctly:

You have RAM leaks, not GPU memory leaks?
There are no such leaks in 1.0.0?

Answering your question: yes, you can use 1.0.0 version. It has the same quality of detection and recognition.

checkrelver commented 1 year ago

Do I understand correctly:
1. You have RAM leaks, not GPU memory leaks?

2. There are no such leaks in 1.0.0?
Answering your question: yes, you can use 1.0.0 version. It has the same quality of detection and recognition.

yes
yes

pospielov commented 1 year ago

I'm sorry for not getting back to you sooner. We investigated memory leaks and found several causes. We fixed almost all of them. One case we didn't fix for now - is if you send images with different resolutions in MobileNet or Arcface, but it looks like it's not your case. We will release them with a new release. I hope your case will be fixed.

jiaxin11 commented 1 year ago

Now that the exadel/compreface-core:1.2.0-arcface-r100-gpu version is released, has the oom problem been fixed?

terrelsa13 commented 10 months ago

This seems related to the OP; except the leak appears to be much slower.

I didn't find out why memory consumption is increasing. But it is reproducible in all CompreFace versions, and there is a limit.

I read this as saying there is a limit to how much RAM Compreface will consume. Figured I'd allocated 64GB of RAM to see what the limit is. Container has been running for 11 days and Compreface has commandeered approximately 29GB of RAM. On my setup it seems to be taking ~2.5GB of RAM per day. I expect the container to crash around day 24.

I have two faces. One with 27 images the other with 48 images. Using the API via Frigate and Double Take.

CORE_VERSION 1.2.0-arcface-r100-gpu Proxmox-VE 8.1.0 (kernel 6.5.11-7-pve) PVE-Manager 8.1.3 Debian 12 LXC Docker 24.0.7 Driver Version: 535.146.02 CUDA Version: 12.2 x2 Xeon E5-2660v4 RTX 4060 8GB Quadro P2000 5GB

Edit - About 19 days in and CompreFace has continued to consume RAM at ~2.5GB per day. Unsure why, as the CompreFace new releases page directly mentions "Performance optimization and memory leak fixes" for version 1.2.

Screenshot from 2024-01-15 11-38-30

Screenshot from 2024-01-15 11-10-32

Screenshot from 2024-01-15 11-08-46

Screenshot from 2024-01-16 17-45-39

Screenshot from 2024-01-19 09-06-31

Screenshot from 2024-01-20 11-47-03

Screenshot from 2024-01-21 10-46-50

Screenshot from 2024-01-22 17-36-04

Screenshot from 2024-01-23 19-02-36

GustavoDenobi commented 6 months ago

Thanks to the last release, I no longer need to restart the core every 3h. Thanks a lot! But the remaining memory issue is still impacting in my use case. To optimize resource usage, I perform movement detection and crop only the regions of interest. But due to the memory leak that happens when sending images with different sizes, I can't use this strategy.

exadel-inc / CompreFace

Increasing memory usage in Core #1034