Open calloatti opened 1 year ago
Another set of logs, same problem, reponse 500
This may be a Docker Desktop problem maybe? Docker Desktop seems unresponsive/sluggish after this error happens. Tried installing some extensions and the UI seems stuck in "Installing"
After about 5 minutes stuck, this happened:
But this could be due to the fact that I tried to close Docker Desktop 10 minutes before and nothing happened?
Unfortunately, I can't say anything about it. It looks like there are some problems with memory allocation, but the reason is unclear to me. Would you happen to have any ideas on how to reproduce it?
I will do some research on how to reproduce it.
I managed to reproduce it, it seems the key issue is to have two instances doing recognition at the same time. I left one instance for 12 hours with no errors, just doing base64 recognition:
http://127.0.0.1:8000/api/v1/recognition/recognize?limit=1&det_prob_threshold=0.50&prediction_count=5&face_plugins=&status=true
I cancelled the process, then set up two instances at the same time doing the same thing, after a couple of minutes got the 500 response error.
Trying to restart either one, it fails every time with 500 error.
Using SubCenter-ArcFace-r100
Images are 480x640, average size 25kb, I only send the face rectangle obtained using a previous face detection api call
Docker Desktop on Windows 10 using integration with WSL2, assigning 16GB to WSL using .wslconfig
This is all in another PC different from the one that I was using when I reported this issue.
Hello. This error message indicates a discrepancy between the memory size required by the MKLDNN operation and the actual size of the NDArray data structure that's being used. The memory size required by an MKL-DNN operation (in this case, 1,140,480 bytes) does not match the actual size of the provided NDArray (which is 737,280 bytes). This mismatch is likely causing the operation to fail because it doesn't have enough memory to complete its tasks.
I've made several attempts to reproduce this bug, but unfortunately, I have not been successful so far. I want to mention that the application works flawlessly on my Windows 10 platform. This issue requires some diagnostic information to better understand the situation. If you could provide the following details, at the time you are getting errors it would be immensely helpful:
Also check please your images in docker
Describe the bug
/api/v1/recognition/recognize response is sometimes 500:
{ "message" : "Error during synchronization between servers: [500 INTERNAL SERVER ERROR] during [POST] to [http://compreface-core:3000/find_faces_base64?limit=1&det_prob_threshold=0.5&face_plugins=calculator] [FacesFeignClient#findFacesBase64(FindFacesRequest,Integer,Double,String)]: [{\"message\":\"MXNetError: MXNetError: could not execute a primitive\"}\n]", "code" : 41 }
{ "message" : "Error during synchronization between servers: [500 INTERNAL SERVER ERROR] during [POST] to [http://compreface-core:3000/find_faces_base64?limit=1&det_prob_threshold=0.5&face_plugins=calculator] [FacesFeignClient#findFacesBase64(FindFacesRequest,Integer,Double,String)]: [{\"message\":\"MXNetError: Traceback (most recent call last):\n File \\"../src/ndarray/ndarray.cc\\", line 707\nMXNetError: The size of NDArray doesn't match the requested MKLDNN memory desc. MKLDNN memory requests for 1140480 bytes, but got 737280 bytes from NDArray\"}\n]", "code" : 41 }
Expected behavior
/api/v1/recognition/recognize to respond with 200
Screenshots
None
Desktop (please complete the following information):
Logs
Logs for each service in separate files inside attached zip. compreface-core log is last 12 hours, since full log is 41GB
SubCenter-ArcFace-r100-logs.zip
Additional context
Have done around 40k requests to a recognition service and around 170k requests to a detection service with no issues, then this started to happen.