Gateway timeout on 200k images

nagem07 commented 2 years ago

Describe the bug After adding 200,000 images through Python SDK recognition service does not work anymore and returns 504 gateway timeout error.

To Reproduce Steps to reproduce the behavior:

Add 200k images
Go to UI and try recognition service

Expected behavior Person is recognised

Screenshots If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

Ubuntu 20.04
Chrome
CompreFace 1.0.0
RTX2060
64GB RAM

Additional context Logs (Using external host as files are too big) https://www.mediafire.com/file/h2gap9jn3cja01i/DB.log/file https://www.mediafire.com/file/sl76j63by5f0xcl/Admin.log/file https://www.mediafire.com/file/d4jumgd0b609ssl/FE.log/file https://www.mediafire.com/file/ibvtf15kdvaq10l/API.log/file https://www.mediafire.com/file/41z5mttlld1us0n/Core.log/file

pospielov commented 2 years ago

Hi, sorry for so long response, I was on a long vacation. According to your API.log file, java fails because it lacks memory. Did you update the .env file? There is such an option in the default configuration: compreface_api_java_options=-Xmx4g You should update it, e.g.: compreface_api_java_options=-Xmx16g

nagem07 commented 2 years ago

Hey,

Yeah I did update the .env file and it fixed the issue, however even if I set it to 16gb on 180k images it shoots up to 28GB, correct me if I am wrong but if I was to use a million images we would be looking at 256 - 384GB of RAM?

pospielov commented 2 years ago

yes, unfortunately, CompreFace is not optimized for such huge face collections. This will require different approaches which could be too heavy for small collections. So we had to choose which collection size to support and for now we chose small collections. I believe 100k is the max comfortable size for CompreFace, of course, it will work with bigger collections, but as you mentioned it would cost too many resources.

nagem07 commented 2 years ago

I see you prioritized smaller collections. From your knowledge, what would be the steps to optimize CompreFace for bigger collections? I have also noticed very slow service initialization, on a 180k images. Meaning that after a reboot, system takes at least 30 minutes of time to load images back into the RAM. Could that be addressed as well?

pospielov commented 2 years ago

To support bigger collections, we need to change the architecture. Now we store face embeddings in Postgres, and then load images into RAM in each compreface-api node to calculate face similarities. Ideally, we need to find a solution for storing and calculating similarities in one place. Furthermore, this place also should be scalable. I know about such a solution, this is a vector database Milvus, it basically does exactly what we need. But they are targeted for enterprise cloud solutions, as a result, their minimum requirements are 16G of RAM. Recommended to have 8 CPUs and 32G of RAM. This is not what most of our users expect from us.

I have also noticed very slow service initialization, on a 180k images. Meaning that after a reboot, system takes at least 30 minutes of time to load images back into the RAM. Could that be addressed as well?

I didn't expect this. For 50k images it takes like 1 minute to load. I'll create a task to check what it could be.

nagem07 commented 2 years ago

Thank you for your answer and sorry for the delay in getting back to you. So having had a look at Milvus, I understand it will be situated before Postgres in the architecture, so embeddings will be saved in Milvus, while subject information would be on Postgres. Am I correct?

Having said that could you advise on which scripts will require changes in order to integrate Milvus? I am looking at above a million images, and using the current architecture has a high cost in terms of resources.

pospielov commented 2 years ago

I didn't research it deeply. I think yes, it should be similar to your description. You need to replace this class, and probably lots of logic related to it with Milvus logic: https://github.com/exadel-inc/CompreFace/blob/master/java/api/src/main/java/com/exadel/frs/core/trainservice/component/classifiers/EuclideanDistanceClassifier.java

martinenkoEduard commented 2 years ago

What is the maxium practical amount of faces?

pospielov commented 2 years ago

It depends on what you mean by practical. I would recommend using not more than 100k-200k. But somebody can use CompreFace with more faces because buying hardware is often cheaper than buying software or paying for custom development.

nagem07 commented 2 years ago

I didn't research it deeply. I think yes, it should be similar to your description. You need to replace this class, and probably lots of logic related to it with Milvus logic: https://github.com/exadel-inc/CompreFace/blob/master/java/api/src/main/java/com/exadel/frs/core/trainservice/component/classifiers/EuclideanDistanceClassifier.java

Would that be the sole class that requires modification or are there others as well?

pospielov commented 2 years ago

Others as well, this is just where you can start discovering.

exadel-inc / CompreFace

Gateway timeout on 200k images #776