Open GustavoDenobi opened 1 year ago
I created a critical bug for this. We will look into it.
I am also having the same problem. The Core will use all the available memory until it crashes the instance.
Just did a whole change in the setup, and the RAM issue still persists.
My current setup is the following:
Now I also have a new issue, which might be worth mentioning, as it could be related:
When I was using Windows/WSL2, I was running the same image (SubCenter-ArcFace-r100-gpu), but could keep the default configuration of 2 processes in the .env file. Now in Ubuntu it will crash the Core after a few requests as it's not possible to run 2 uwsgi in the gpu, beacuse it uses (~2700MB) right after initialization, but will eventually occupy up to 4016MB of VRAM (which seems to be the limit, as I didn't see it growing past that). Still don't know how to explain this behavior, as I expected it would run even better in Ubuntu, not having the load of Windows applications competing for GPU.
Another update that might be useful. I was able to run with 2 processes, but only if I limit the size of the images I send in the recognition endpoint. In this case, each uwsgi instance consumes 2694MiB of VRAM steadly (checked via nvidia-smi).
@evandromoura @GustavoDenobi
Could you pull images again and test again?
docker compose pull
On 2 March, we released CompreFace with ARM support. To make CompreFace support ARM we had to update some libraries.
I checked - the old version of CompreFace doesn't have such a problem.
So we rolled back this release.
Still, this ticket was created on 27 February, but I want to ensure the problem is still reproducible on your machines.
As it doesn't reproduce on my machine anymore.
Just started testing. Still too soon to say something about the RAM memory thing, but still can't run 2 processes. I'm sending images with 640x720 pixels. Hitting the recognition endpoint with frames without faces, the uwsgi VRAM usage go to something like 2240MiB. The first time I send a image with face, one of the instances go to 3084MiB. The second time makes it crash, (seems obvious to me that it fills the VRAM), and then the instance that couldn't find enough memory starts triggering the following message:
"Error during synchronization between servers: [500 INTERNAL SERVER ERROR] during [POST] to [http://compreface-core:3000/find_faces] [FacesFeignClient#findFaces(MultipartFile,Integer,Double,String)]: [{"message":"PluginError: insightface.Calculator@arcface-r100-msfdrop75 error - simple_bind error. Arguments:\ndata: (1, 3, 112, 112)\nTraceback (most recent call last):\n File \"../src/storage/./pooled_storage_manager.h\", line 160\nMXNetError: cudaMalloc retry failed: out of memory"}]"
I'd be ok with not being able to run the container in my current GPU due to lack of memory, but given that I was able to run it steadly for days just 2 weeks ago with only the RAM build up problem, sending even bigger images, I believe it's a problem somewhere else.
Just checked that the insightface version used in Core is a bit old (0.1.5 -> 0.7.2). Tried to update it but it broke some other dependencies.
I'm having a similar issue running the following: i7-1165G7 16GB RAM 500GB NVME RTX2060 6GB RAM Ubuntu 22.04 Docker 23.0.1 Driver Version: 525.85.05 CUDA Version: 12.0
running the SubCenter-ArcFace-R100-gpu maxes out the GPU RAM very quickly on my box. The process of even uploading 2-3 images is sufficient which then crashes the the core container. Unless I then kill the container the GPU RAM never subsides. I have done a docker compose pull
and the issue remains
I'm sorry for not getting back to you sooner. I did a lot of tests, and here are the results:
.env
file CORE_VERSION=1.1.0-arcface-r100-gpu
to CORE_VERSION=1.0.0-arcface-r100-gpu
. 1.0.0 version of compreface-core
should work fine with 1.1.0 version of other CompreFace containers. The only limitation is that you won't be able to use the 'face pose' plugin.
@arakasi55 replace in your .env
file uwsgi_processes=2
to uwsgi_processes=1
. Your GPU can't run two processes with such a big neural network. You can continue using 1.1.0 version, as in my tests, it didn't use more than 4.2Gb per process.In our plans, we won't try to optimize mxnet version. It looks like mxnet is dead. We need to migrate to another library. It makes sense to spend time optimizing the new library.
Thanks for the testing and suggestion, will update the env file and report back.
So far so good after chancing the max processes to 1. GPU RAM sits consistently at 4.072GiB Thanks again
What are the implications of setting uwsgi_processes=1? Does it result in poorer performance? Is it only able to process half as many images (in a given period of time) as it would with 2 processes?
In short - yes, you can expect this. But in reality, it's not so simple. If one process can load your GPU 100%, then adding a new process won't increase performance. Also, there may be other bottlenecks in the system.
You can benckmark it on your system with 1.0.0-arcface-r100-gpu
version, and see the difference.
I have the same problem with mobile-net gpu build . I am using docker desktop in windows 10 wsl i did some testing with 1.1.0 core and 3 uwsgi processes:
jpgs 2592x1944 (with and without faces) were sent 4 times per second from 3 sources. it didn't leak much (200 mb in 30 minutes)
2 jpg sources 2592x1944 and 1 source 1920x1080 4 times per second were sent to compreface. this triggered a reactive memory leak. about 100 mb per minute
the amount of gpu memory in both tests kept a stable value
core 1.0.0 does not have these problems and consumes less ram/gpu memory. if I use it, will I not lose the quality of detection and recognition in comparison with 1.1.0?
Do I understand correctly:
Answering your question: yes, you can use 1.0.0 version. It has the same quality of detection and recognition.
Do I understand correctly:
1. You have RAM leaks, not GPU memory leaks? 2. There are no such leaks in 1.0.0?
Answering your question: yes, you can use 1.0.0 version. It has the same quality of detection and recognition.
I'm sorry for not getting back to you sooner. We investigated memory leaks and found several causes. We fixed almost all of them. One case we didn't fix for now - is if you send images with different resolutions in MobileNet or Arcface, but it looks like it's not your case. We will release them with a new release. I hope your case will be fixed.
Now that the exadel/compreface-core:1.2.0-arcface-r100-gpu version is released, has the oom problem been fixed?
This seems related to the OP; except the leak appears to be much slower.
I didn't find out why memory consumption is increasing. But it is reproducible in all CompreFace versions, and there is a limit.
I read this as saying there is a limit to how much RAM Compreface will consume. Figured I'd allocated 64GB of RAM to see what the limit is. Container has been running for 11 days and Compreface has commandeered approximately 29GB of RAM. On my setup it seems to be taking ~2.5GB of RAM per day. I expect the container to crash around day 24.
I have two faces. One with 27 images the other with 48 images. Using the API via Frigate and Double Take.
CORE_VERSION 1.2.0-arcface-r100-gpu Proxmox-VE 8.1.0 (kernel 6.5.11-7-pve) PVE-Manager 8.1.3 Debian 12 LXC Docker 24.0.7 Driver Version: 535.146.02 CUDA Version: 12.2 x2 Xeon E5-2660v4 RTX 4060 8GB Quadro P2000 5GB
Edit - About 19 days in and CompreFace has continued to consume RAM at ~2.5GB per day. Unsure why, as the CompreFace new releases page directly mentions "Performance optimization and memory leak fixes" for version 1.2.
Thanks to the last release, I no longer need to restart the core every 3h. Thanks a lot! But the remaining memory issue is still impacting in my use case. To optimize resource usage, I perform movement detection and crop only the regions of interest. But due to the memory leak that happens when sending images with different sizes, I can't use this strategy.
Describe the bug
Core occupies memory until eventually taking all memory available.
To Reproduce
Steps to reproduce the behavior:
Expected behavior
Continuous availability.
Desktop (please complete the following information):
Additional context
Core occupies memory until eventually taking all memory available. Tried increasing available memory, but even 23GB (set via .wslconf) wasn't enough.
I'm running the ArcFace GPU custom build in Ubuntu 22.04 (WSL2 in Windows 10), but the same behavior happened in Ubuntu 20.04 (also in WSL2). Even tried using Docker Desktop, but the issue persists. I have less than 1K faces in the database now.
Tried to reduce memory allocated to JAVA (api and admin) to as low as 1GB for each, but the result was the same.
My temporary solution is to restart the container every 3 hours, but I can't keep doing it forever.