exadel-inc / CompreFace

Leading free and open-source face recognition system
https://exadel.com/accelerator-showcase/compreface/
Apache License 2.0
5.7k stars 775 forks source link

Error during synchronization between servers #1087

Open calloatti opened 1 year ago

calloatti commented 1 year ago

Describe the bug

/api/v1/recognition/recognize response is sometimes 500:

{ "message" : "Error during synchronization between servers: [500 INTERNAL SERVER ERROR] during [POST] to [http://compreface-core:3000/find_faces_base64?limit=1&det_prob_threshold=0.5&face_plugins=calculator] [FacesFeignClient#findFacesBase64(FindFacesRequest,Integer,Double,String)]: [{\"message\":\"MXNetError: MXNetError: could not execute a primitive\"}\n]", "code" : 41 }

{ "message" : "Error during synchronization between servers: [500 INTERNAL SERVER ERROR] during [POST] to [http://compreface-core:3000/find_faces_base64?limit=1&det_prob_threshold=0.5&face_plugins=calculator] [FacesFeignClient#findFacesBase64(FindFacesRequest,Integer,Double,String)]: [{\"message\":\"MXNetError: Traceback (most recent call last):\n File \\"../src/ndarray/ndarray.cc\\", line 707\nMXNetError: The size of NDArray doesn't match the requested MKLDNN memory desc. MKLDNN memory requests for 1140480 bytes, but got 737280 bytes from NDArray\"}\n]", "code" : 41 }

Expected behavior

/api/v1/recognition/recognize to respond with 200

Screenshots

None

Desktop (please complete the following information):

Logs

Logs for each service in separate files inside attached zip. compreface-core log is last 12 hours, since full log is 41GB

SubCenter-ArcFace-r100-logs.zip

Additional context

Have done around 40k requests to a recognition service and around 170k requests to a detection service with no issues, then this started to happen.

calloatti commented 1 year ago

Another set of logs, same problem, reponse 500

logs-20230619135800.zip

calloatti commented 1 year ago

This may be a Docker Desktop problem maybe? Docker Desktop seems unresponsive/sluggish after this error happens. Tried installing some extensions and the UI seems stuck in "Installing"

After about 5 minutes stuck, this happened:

image

But this could be due to the fact that I tried to close Docker Desktop 10 minutes before and nothing happened?

pospielov commented 1 year ago

Unfortunately, I can't say anything about it. It looks like there are some problems with memory allocation, but the reason is unclear to me. Would you happen to have any ideas on how to reproduce it?

calloatti commented 1 year ago

I will do some research on how to reproduce it.

calloatti commented 1 year ago

I managed to reproduce it, it seems the key issue is to have two instances doing recognition at the same time. I left one instance for 12 hours with no errors, just doing base64 recognition:

http://127.0.0.1:8000/api/v1/recognition/recognize?limit=1&det_prob_threshold=0.50&prediction_count=5&face_plugins=&status=true

I cancelled the process, then set up two instances at the same time doing the same thing, after a couple of minutes got the 500 response error.

Trying to restart either one, it fails every time with 500 error.

logs.zip

Using SubCenter-ArcFace-r100

Images are 480x640, average size 25kb, I only send the face rectangle obtained using a previous face detection api call

Docker Desktop on Windows 10 using integration with WSL2, assigning 16GB to WSL using .wslconfig

This is all in another PC different from the one that I was using when I reported this issue.

500-1

500-2

image

Anatolii-R commented 1 year ago

Hello. This error message indicates a discrepancy between the memory size required by the MKLDNN operation and the actual size of the NDArray data structure that's being used. The memory size required by an MKL-DNN operation (in this case, 1,140,480 bytes) does not match the actual size of the provided NDArray (which is 737,280 bytes). This mismatch is likely causing the operation to fail because it doesn't have enough memory to complete its tasks.

I've made several attempts to reproduce this bug, but unfortunately, I have not been successful so far. I want to mention that the application works flawlessly on my Windows 10 platform. This issue requires some diagnostic information to better understand the situation. If you could provide the following details, at the time you are getting errors it would be immensely helpful:

Also check please your images in docker