Open nagem07 opened 2 years ago
Hi, sorry for so long response, I was on a long vacation.
According to your API.log file, java fails because it lacks memory. Did you update the .env file? There is such an option in the default configuration:
compreface_api_java_options=-Xmx4g
You should update it, e.g.:
compreface_api_java_options=-Xmx16g
Hey,
Yeah I did update the .env file and it fixed the issue, however even if I set it to 16gb on 180k images it shoots up to 28GB, correct me if I am wrong but if I was to use a million images we would be looking at 256 - 384GB of RAM?
yes, unfortunately, CompreFace is not optimized for such huge face collections. This will require different approaches which could be too heavy for small collections. So we had to choose which collection size to support and for now we chose small collections. I believe 100k is the max comfortable size for CompreFace, of course, it will work with bigger collections, but as you mentioned it would cost too many resources.
I see you prioritized smaller collections. From your knowledge, what would be the steps to optimize CompreFace for bigger collections? I have also noticed very slow service initialization, on a 180k images. Meaning that after a reboot, system takes at least 30 minutes of time to load images back into the RAM. Could that be addressed as well?
To support bigger collections, we need to change the architecture.
Now we store face embeddings in Postgres, and then load images into RAM in each compreface-api
node to calculate face similarities.
Ideally, we need to find a solution for storing and calculating similarities in one place. Furthermore, this place also should be scalable. I know about such a solution, this is a vector database Milvus, it basically does exactly what we need. But they are targeted for enterprise cloud solutions, as a result, their minimum requirements are 16G of RAM. Recommended to have 8 CPUs and 32G of RAM. This is not what most of our users expect from us.
I have also noticed very slow service initialization, on a 180k images. Meaning that after a reboot, system takes at least 30 minutes of time to load images back into the RAM. Could that be addressed as well?
I didn't expect this. For 50k images it takes like 1 minute to load. I'll create a task to check what it could be.
Thank you for your answer and sorry for the delay in getting back to you. So having had a look at Milvus, I understand it will be situated before Postgres in the architecture, so embeddings will be saved in Milvus, while subject information would be on Postgres. Am I correct?
Having said that could you advise on which scripts will require changes in order to integrate Milvus? I am looking at above a million images, and using the current architecture has a high cost in terms of resources.
I didn't research it deeply. I think yes, it should be similar to your description. You need to replace this class, and probably lots of logic related to it with Milvus logic: https://github.com/exadel-inc/CompreFace/blob/master/java/api/src/main/java/com/exadel/frs/core/trainservice/component/classifiers/EuclideanDistanceClassifier.java
What is the maxium practical amount of faces?
It depends on what you mean by practical. I would recommend using not more than 100k-200k. But somebody can use CompreFace with more faces because buying hardware is often cheaper than buying software or paying for custom development.
I didn't research it deeply. I think yes, it should be similar to your description. You need to replace this class, and probably lots of logic related to it with Milvus logic: https://github.com/exadel-inc/CompreFace/blob/master/java/api/src/main/java/com/exadel/frs/core/trainservice/component/classifiers/EuclideanDistanceClassifier.java
Would that be the sole class that requires modification or are there others as well?
Others as well, this is just where you can start discovering.
Describe the bug After adding 200,000 images through Python SDK recognition service does not work anymore and returns 504 gateway timeout error.
To Reproduce Steps to reproduce the behavior:
Expected behavior Person is recognised
Screenshots If applicable, add screenshots to help explain your problem.
Desktop (please complete the following information):
Additional context Logs (Using external host as files are too big) https://www.mediafire.com/file/h2gap9jn3cja01i/DB.log/file https://www.mediafire.com/file/sl76j63by5f0xcl/Admin.log/file https://www.mediafire.com/file/d4jumgd0b609ssl/FE.log/file https://www.mediafire.com/file/ibvtf15kdvaq10l/API.log/file https://www.mediafire.com/file/41z5mttlld1us0n/Core.log/file