exadel-inc / CompreFace

Leading free and open-source face recognition system
https://exadel.com/accelerator-showcase/compreface/
Apache License 2.0
5.7k stars 775 forks source link

[BUG] Nvidia in Docker: MobilenetGPU/ - uWSGI died - How to debug? #652

Open ozett opened 3 years ago

ozett commented 3 years ago

Describe the bug detection hangs, because of processes killed. the log scrolls up with some backtraces i cannot read

To Reproduce Steps to reproduce the behavior: starting docker-compose going to webgui testing facerec, but it doesnt run

Expected behavior running compreface gpu models with compreface-core in docker without error

Screenshots image

Desktop (please complete the following information):

root@ub20-frigate4:/usr/src# nvidia-smi -L GPU 0: NVIDIA GeForce GTX 1660 (UUID: GPU-65d82c7a-fb69-3e25-a081-2baef57fba23) root@ub20-frigate4:/usr/src#


**Additional context**

.env:

root@ub20-frigate4:/usr/src# cat .env registry=exadel/ postgres_username=postgres postgres_password=postgres postgres_db=frs postgres_domain=compreface-postgres-db postgres_port=5432 email_host=smtp.gmail.com email_username= email_from= email_password= enable_email_server=false save_images_to_db=true compreface_api_java_options=-Xmx8g compreface_admin_java_options=-Xmx8g ADMIN_VERSION=0.6.1 API_VERSION=0.6.1 FE_VERSION=0.6.1 CORE_VERSION=0.6.1-mobilenet-gpu root@ub20-frigate4:/usr/src#


docker-compose.yml:

root@ub20-frigate4:/usr/src# cat dc-cface.yml version: '3.4'

volumes: postgres-data:

services: compreface-postgres-db: image: postgres:11.5 container_name: "compreface-postgres-db" environment:

dont know whats needed, dont know how to get more debug-outout. but i will do to help..

compreface-core           | [17:37:48] src/engine/engine.cc:55: MXNet start using engine: ThreadedEnginePerDevice
compreface-core           | !!! uWSGI process 78 got Segmentation Fault !!!
compreface-core           | *** backtrace of 78 ***
compreface-core           | uwsgi(uwsgi_backtrace+0x2a) [0x55e0d380c33a]
compreface-core           | uwsgi(uwsgi_segfault+0x23) [0x55e0d380c723]
compreface-core           | /lib/x86_64-linux-gnu/libc.so.6(+0x3f040) [0x7f6c3c736040]

testing NVIDIA-Docker was succes

image

pospielov commented 3 years ago

Hi, to be honest, I have no idea :( Hate CUDA for this. It may just don't work and you don't know why. This is what I know - we use CUDA 10.0. Probably because this was the version that worked - as you see sometimes it just does not work, so we took the version that works. This version is installed inside the container, so it doesn't matter what CUDA version is in your machine. What matters is Nvidia driver. My current driver is 470.74, but it worked on 460 as well. My only guess is that you have another application that takes the GPU so CompreFace doesn't have access to it. As far as I see, your screenshots with nvidia-smi were done inside the container. What about the host machine?

ozett commented 3 years ago

hi, thanks for looking into this..

nvidia-smi is running on the host. nvidia-smi inside a container does not give information about gpu-task. nvidia is a small company.

the whole nvidia-container thing is running fine from using it with the frigate-nvidia container on the same maschine.

ist must be something inside the compreface GPU-container:

CORE_VERSION=0.6.1-mobilenet-gpu or CORE_VERSION=0.6.1-arcface-r100-gpu

any hint how to track this down inside your container?

pospielov commented 3 years ago

I mean here is the results of nvidia-smi if I run it on the host (not inside the container) image As you can see, without CompreFace there are several applications that use GPU. I wanted to see which applications use your GPU in your host

any hint how to track this down inside your container? I don't see how is it possible I mean the problem is not that it's in the container, but the problem is that the error is in compiled .so code, If it was a python, you could debug it. But you can't debug compiled code

ozett commented 3 years ago

I wanted to see which applications use your GPU in your host

as there is nothing else on this server than the container for compreface therefore on the GPU is nothing else running than what is configured inside the compreface-core containter for the nivida runtime.

i will try to catch this and post here in an follow up

ozett commented 3 years ago

success with gpu: i started from the beginning:

1) re-installed nvidia-driver on ubuntu host with nvidia.run 1) removed all left-over container: docker system prune all https://docs.docker.com/config/pruning/#prune-everything

2) changed my way how to install compreface:

SUCCESS. its up and running with NVIDIA gpu in container.

what i changed: before i only changed the .env file in always the same source-dir. seems to lead to errors

image