exadel-inc / CompreFace

Leading free and open-source face recognition system
https://exadel.com/accelerator-showcase/compreface/
Apache License 2.0
5.26k stars 723 forks source link

Using Recognition Service with more than 20k images registered #621

Closed AnkurChatter closed 2 years ago

AnkurChatter commented 2 years ago

Hi, I tested the application and it works really amazing. Thank you so much for the build and documentation. But when I upload images in excess of 1k and test the recognition services on it, the API container fails.

After that I looked into custom builds and tried mobile net/face net/insight face, all of them with no GPU support but after uploading 3k images, the container again failed during a call to recognition service for just 1 patient. My current laptop configuration is - CPU - iIntel Core i3-1115G4 with 4.1 Ghz clock speed (Dual Core ) , OS - Win 10 64 bit, RAM - 8GB DDR4 2666 Mhz Hard Disk - 1TB HDD Graphic Card - Intel UHD Graphics 630 1666 MB

The current laptop doesn't have nvidia graphics so no additional GPU support. This will act as a standalone system which should be capable of using the face recognition algorithm upto a scale of 20k images. I cannot host this on cloud, need this to be on-premise

Can you please suggest a custom build and if such a build is not possible, what kind of system configurations will be needed to scale up to that level ?

Update :- I will try running the docker container in a kubernetes cluster and see if that can help me in solving the scaling issue.

Thank you.

pospielov commented 2 years ago

Hi, firstly some statistics - we have a server with saved 50k images with 12Gb of RAM and it works fine. image

Probably containers fail because of OOM error - they try to use more memory, but you don't have it. But I think 8 Gb may be enough, all you need is to limit the memory that containers can use. To do this, change your .env file: compreface_api_java_options=-Xmx3g compreface_admin_java_options=-Xmx1g

pospielov commented 2 years ago

BTW if you want to run it in Kubernetes, check out this repository: https://github.com/exadel-inc/compreface-kubernetes

AnkurChatter commented 2 years ago

Hi @pospielov ,

Thank you so much for the quick reply, yes I was checking the kubernetes repository and the stats you have given helps a lot. Like you said yes the application was crashing due to OOM error, I will also try the suggestions you have mentioned and get back to you.

Again, thank you so much for the quick turn around.

AnkurChatter commented 2 years ago

Hi @pospielov ,

I tried the suggestion (compreface_api_java_options=-Xmx3g, compreface_admin_java_options=-Xmx1g), but now the container is crashing even with 1k images when I try to run the recognition service. Can you please give me your server specifications ?

My current laptop is not very high end and I have an intel graphics card(Intel UHD Graphics 630 1536 MB), not Nvidia. I saw some of the env file has runtime key set as Nvidia in .env file.

Can I set the runtime to intel, if so how, can you please help/point me to a link where I can understand how to set it up ?

pospielov commented 2 years ago

Hi here is our hardware: Intel(R) Xeon(R) CPU E5-2640 v3 @ 2.60GHz 4 cores 12Gb RAM no GPU

The question here is who is the initiator of OOM error - Linux or Java If Linux - you can try to reduce this value even more If Java - then you need to increase this number AND increase the RAM on your machine. Also, do you use Linux or Windows? Because docker under Windows requires more RAM than under Linux and this could be a problem

AnkurChatter commented 2 years ago

I ran the application on a MAC machine as well and after doing that this is the error I am getting. I have uploaded 1k images already and now I am trying to run recognition service which collapses(kills) the API container.

Screenshot 2021-10-05 at 7 57 38 PM
pospielov commented 2 years ago

looks like OS kills it. Try to reduce the compreface_api_java_options number to -Xmx2g

AnkurChatter commented 2 years ago

Sure checking now. Also, I am running the already downloaded docker images with the new environment, I hope that is okay.

Based on my understanding, it should not be an issue but please let me know if I need to delete the images too.

AnkurChatter commented 2 years ago

Again got killed same error. This is my docker configuration

Screenshot 2021-10-05 at 8 11 16 PM
pospielov commented 2 years ago

hah, is this configuration is for all containers? So 3Gb is a sum of all RAM consumption? Probably it can be not enough. Could you try 6Gb with the same configuration? compreface_api_java_options=-Xmx2g compreface_admin_java_options=-Xmx1g And yes - you don't need to redownload the images - this setup is Java runtime setup

AnkurChatter commented 2 years ago

When you say all containers, what do you mean ? If you mean Kubernetes, I am still exploring that so will get back to you. But if you mean all Compreface containers(admin, api, core etc ) can use maximum 3 GB, yes.

I have updated RAM to 6 GB overall, now uploading images and checking, will keep you posted with results.

AnkurChatter commented 2 years ago

I was successfully able to upload 15k images after increasing overall RAM to 6 GB and kept the configuration as - compreface_api_java_options=-Xmx2g compreface_admin_java_options=-Xmx1g

This is my current docker stats

Screenshot 2021-10-07 at 9 23 09 AM

I will also try to add Kubernetes into the mix and upload another 10k images to check if the container fails. But looks like decreasing the JAVA runtime heap and increasing RAM worked well.

pospielov commented 2 years ago

yes, this is what I was talked about - the sum of RAM consumption is more than 3Gb which leads to OOM. When you increased the limit it started to work.

AnkurChatter commented 2 years ago

Yes, also can you please tell me what is the total number of connections possible ? I get this error in the API container - Connection Error: sorry, too many clients already

and once I get this, I am not able to proceed further without killing all the containers and deleting the complete volume. The recognition service completely fails once I get this error. In a production environment, this would be a big issue, so just wanted to know what is causing this error and how can we mitigate it.

pospielov commented 2 years ago

Hi, this is a bug. I already fixed it, just need to publish a hotfix. I'll try to do it today

AnkurChatter commented 2 years ago

Sure, once you do I will upload lots of images and test it and will keep you posted here. Thank You.

pospielov commented 2 years ago

hi, I uploaded the new 0.6.1 version to dockerhub, could you check if it fixes your problem?

AnkurChatter commented 2 years ago

Hi , I downloaded the latest version 0.6.1. After downloading, I ran the recognition service in registration mode(registering new face). http://localhost:8000/api/v1/recognition/faces/?subject=Akash This worked and gave me a correct result. Here is the API container log for the above

Screenshot 2021-10-09 at 12 53 01 PM

Though the service is a little slow compared to previous version(0.6.0) in terms of how much time it takes for new registration. (Previous was less than 1 sec, this one took 3-4 secs)

After the registration, I ran the service in recognition mode which failed This is the URL - http://localhost:8000/api/v1/recognition/recognize

It was timing out most of the time. I tried checking the container logs but there was nothing there in the API container and Core container said the request was success. The CPU stats for API container went more than 200% and RAM at 2.5 GB for a single request which is not correct. Core container stats was completely fine. Please find the error below on browser.

Screenshot 2021-10-09 at 12 55 19 PM

I also tried this service to find total number of subjects - http://localhost:8000/api/v1/recognition/subjects/ which worked.

I am not sure of the reason behind some services passing, some failing and the logs are not helpful so was not able to debug further.

Total RAM is 6 GB and CPU cores are 3

AnkurChatter commented 2 years ago

UPDATE :- I also tried testing the service through the Compreface Admin UI - For Demo Service - uploaded the image and waited for the service to return the result. The service was successful and got the result. For Newly Created Service - uploaded the same image and waited for the service to return the result. The service failed with gateway timeout error.

I will create a new service(new API Key) and check what happens then.

AnkurChatter commented 2 years ago

UPDATE :- So I created a new service and now it is working. Uploading images, will keep you posted on the progress.

Also, I do get this error sometimes - *_{ "message" : "Error during synchronization between servers: [500 INTERNAL SERVER ERROR] during [POST] to [http://compreface-core:3000/find_faces] [FacesFeignClient#findFaces(MultipartFile,Integer,Double,String)]: [{\"message\":\"ValueError: in user code:\n\n /usr/local/lib/python3.7/site-packages/tensorflow/python/keras/engine/training.py:1569 predict_function \n return stepfunction(self, iterator)\n ... (2771 bytes)]", "code" : 41 }**

pospielov commented 2 years ago

Hi

  1. It's quite strange that it didn't work with the old API key, the code changes are shouldn't affect it at all and I checked the old collection on my local machine - it worked
  2. I do not see any speed degradation on my machine. Probably that was only the first request which is very long because of the initialization of the libraries.
  3. What is the full error? Is it "ValueError: Your Layer or Model is in an invalid state"? If yes - I also encountered this error, looks like a bug in the new Tensorflow version, I'm investigating what we can do with it.
AnkurChatter commented 2 years ago

Hey, Yeah it worked with the new API Key. I was able to push 20k images with 6 GB RAM and 3 CPU cores and 2 GB Swap Memory.

Also, yeah the error I mentioned above keeps on happening every few API calls, I was not able to check the container logs.

pospielov commented 2 years ago

I'm looking at what we can do with this error. The problem is that this error is very rare on my laptop. A week ago I was able to reproduce the error in almost every request, but then I rebooted the laptop and now it is reproducible like once in 1000 requests.

pospielov commented 2 years ago

Hi @AnkurChatter I pushed new Docker images to DockerHub I think it should fix the errors above To check, please run: docker-compose pull - it will update your 0.6.1 images and then docker-compose up -d as usual

AnkurChatter commented 2 years ago

Hi @pospielov , Thank you, I will pull the latest image and check it out.