exadel-inc / CompreFace

Leading free and open-source face recognition system
https://exadel.com/accelerator-showcase/compreface/
Apache License 2.0
5.52k stars 754 forks source link

Out of memmory error on SubCenter-ArcFace-r100-gpu (ubuntu 22.04, Nvidia GTX 10603gb) #844

Open martinenkoEduard opened 2 years ago

martinenkoEduard commented 2 years ago

it works for a while (and I must say it is blazingly FAST) and after ~50 images it starts to drop images with this error:

face-api | compreface-ui | 172.20.0.1 - - [22/Jul/2022:21:38:25 +0000] "POST /api/v1/detection/detect?&face_plugins=calculator HTTP/1.1" 500 467 "-" "python-requests/2.25.1" compreface-core | {"severity": "CRITICAL", "message": "MXNetError: [21:38:25] /home/travis/build/dmlc/mxnet-distro/mxnet-build/3rdparty/mshadow/mshadow/././././cuda/tensor_gpu-inl.cuh:110: Check failed: err == cudaSuccess (2 vs. 0) : Name: MapPlanKernel ErrStr:out of memory\nStack trace:\n [bt] (0) /usr/local/lib/python3.7/dist-packages/mxnet/libmxnet.so(+0x4b04cb) [0x7f76f6bbf4cb]\n [bt] (1) /usr/local/lib/python3.7/dist-packages/mxnet/libmxnet.so(+0x2f59431) [0x7f76f9668431]\n [bt] (2) /usr/local/lib/python3.7/dist-packages/mxnet/libmxnet.so(+0x31b61ee) [0x7f76f98c51ee]\n [bt] (3) /usr/local/lib/python3.7/dist-packages/mxnet/libmxnet.so(+0x31b9a16) [0x7f76f98c8a16]\n [bt] (4) /usr/local/lib/python3.7/dist-packages/mxnet/libmxnet.so(+0x25db7a9) [0x7f76f8cea7a9]\n [bt] (5) /usr/local/lib/python3.7/dist-packages/mxnet/libmxnet.so(+0x25e1a1a) [0x7f76f8cf0a1a]\n [bt] (6) /usr/local/lib/python3.7/dist-packages/mxnet/libmxnet.so(+0x25c1cd1) [0x7f76f8cd0cd1]\n [bt] (7) /usr/local/lib/python3.7/dist-packages/mxnet/libmxnet.so(+0x25c51e0) [0x7f76f8cd41e0]\n [bt] (8) /usr/local/lib/python3.7/dist-packages/mxnet/libmxnet.so(+0x25c5476) [0x7f76f8cd4476]\n\n", "request": {"method": "POST", "path": "/find_faces", "filename": "image.jpg", "api_key": "", "remoteaddr": "172.20.0.4"}, "logger": "src.services.flask.error_handling", "module": "error_handling", "traceback": "Traceback (most recent call last):\n File \"/usr/local/lib/python3.7/dist-packages/flask/app.py\", line 1950, in full_dispatch_request\n rv = self.dispatch_request()\n File \"/usr/local/lib/python3.7/dist-packages/flask/app.py\", line 1936, in dispatch_request\n return self.view_functionsrule.endpoint\n File \"./src/services/flask_/needs_attached_file.py\", line 32, in wrapper\n return f(args, **kwargs)\n File \"./src/_endpoints.py\", line 72, in find_faces_post\n face_plugins=face_plugins\n File \"./src/services/facescan/plugins/mixins.py\", line 44, in call\n faces = self._fetch_faces(img, det_prob_threshold)\n File \"./src/services/facescan/plugins/mixins.py\", line 51, in _fetch_faces\n boxes = self.find_faces(img, det_prob_threshold)\n File \"./src/services/facescan/plugins/insightface/insightface.py\", line 83, in find_faces\n results = self._detection_model.get(img, det_thresh=det_prob_threshold)\n File \"/usr/local/lib/python3.7/dist-packages/insightface/app/face_analysis.py\", line 39, in get\n bboxes, landmarks = self.det_model.detect(img, threshold=det_thresh, scale = det_scale)\n File \"/usr/local/lib/python3.7/dist-packages/insightface/model_zoo/face_detection.py\", line 303, in detect\n scores = net_out[idx].asnumpy()\n File \"/usr/local/lib/python3.7/dist-packages/mxnet/ndarray/ndarray.py\", line 1996, in asnumpy\n ctypes.c_size_t(data.size)))\n File \"/usr/local/lib/python3.7/dist-packages/mxnet/base.py\", line 253, in check_call\n raise MXNetError(py_str(_LIB.MXGetLastError()))\nmxnet.base.MXNetError: [21:38:25] /home/travis/build/dmlc/mxnet-distro/mxnet-build/3rdparty/mshadow/mshadow/././././cuda/tensor_gpu-inl.cuh:110: Check failed: err == cudaSuccess (2 vs. 0) : Name: MapPlanKernel ErrStr:out of memory\nStack trace:\n [bt] (0) /usr/local/lib/python3.7/dist-packages/mxnet/libmxnet.so(+0x4b04cb) [0x7f76f6bbf4cb]\n [bt] (1) /usr/local/lib/python3.7/dist-packages/mxnet/libmxnet.so(+0x2f59431) [0x7f76f9668431]\n [bt] (2) /usr/local/lib/python3.7/dist-packages/mxnet/libmxnet.so(+0x31b61ee) [0x7f76f98c51ee]\n [bt] (3) /usr/local/lib/python3.7/dist-packages/mxnet/libmxnet.so(+0x31b9a16) [0x7f76f98c8a16]\n [bt] (4) /usr/local/lib/python3.7/dist-packages/mxnet/libmxnet.so(+0x25db7a9) [0x7f76f8cea7a9]\n [bt] (5) /usr/local/lib/python3.7/dist-packages/mxnet/libmxnet.so(+0x25e1a1a) [0x7f76f8cf0a1a]\n [bt] (6) /usr/local/lib/python3.7/dist-packages/mxnet/libmxnet.so(+0x25c1cd1) [0x7f76f8cd0cd1]\n [bt] (7) /usr/local/lib/python3.7/dist-packages/mxnet/libmxnet.so(+0x25c51e0) [0x7f76f8cd41e0]\n [bt] (8) /usr/local/lib/python3.7/dist-packages/mxnet/libmxnet.so(+0x25c5476) [0x7f76f8cd4476]\n\n\n", "build_version": "dev"} compreface-api | 2022-07-22 21:38:25.481 ERROR 7 --- [nio-8080-exec-4] c.e.f.c.h.ResponseExceptionHandler : Defined exception occurred compreface-api | compreface-api | com.exadel.frs.commonservice.sdk.faces.exception.FacesServiceException: Error during synchronization between servers: [500 INTERNAL SERVER ERROR] during [POST] to [http://compreface-core:3000/find_faces] [FacesFeignClient#findFaces(MultipartFile,Integer,Double,String)]: [{"message":"MXNetError: [21:38:25] /home/travis/build/dmlc/mxnet-distro/mxnet-build/3rdparty/mshadow/mshadow/././././cuda/tensor_gpu-inl.cuh:110: Check failed: err == cudaSuccess (2 vs. 0) : Name: Map... (1133 bytes)] compreface-api | at com.exadel.frs.commonservice.sdk.faces.service.FacesRestApiClient.findFaces(FacesRestApiClient.java:34) compreface-api | at com.exadel.frs.commonservice.sdk.faces.service.FacesRestApiClient$$FastClassBySpringCGLIB$$517e8caf.invoke() compreface-api | at org.springframework.cglib.proxy.MethodProxy.invoke(MethodProxy.java:218) compreface-api | at org.springframework.aop.framework.CglibAopProxy$DynamicAdvisedInterceptor.intercept(CglibAopProxy.java:687) compreface-api | at com.exadel.frs.commonservice.sdk.faces.service.FacesRestApiClient$$EnhancerBySpringCGLIB$$5f1e9a2e.findFaces() compreface-api | at com.exadel.frs.core.trainservice.service.FaceDetectionProcessServiceImpl.processImage(FaceDetectionProcessServiceImpl.java:31) compreface-api | at com.exadel.frs.core.trainservice.service.FaceDetectionProcessServiceImpl.processImage(FaceDetectionProcessServiceImpl.java:13) compreface-api | at com.exadel.frs.core.trainservice.controller.DetectionController.detect(DetectionController.java:71) compreface-api | at com.exadel.frs.core.trainservice.controller.DetectionController$$FastClassBySpringCGLIB$$6a25be2c.invoke() compreface-api | at org.springframework.cglib.proxy.MethodProxy.invoke(MethodProxy.java:218) compreface-api | at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.invokeJoinpoint(CglibAopProxy.java:771) compreface-api | at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:163) compreface-api | at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.proceed(CglibAopProxy.java:749) compreface-api | at org.springframework.validation.beanvalidation.MethodValidationInterceptor.invoke(MethodValidationInterceptor.java:119) compreface-api | at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:186) compreface-api | at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.proceed(CglibAopProxy.java:749) compreface-api | at org.springframework.aop.framework.CglibAopProxy$DynamicAdvisedInterceptor.intercept(CglibAopProxy.java:691) compreface-api | at com.exadel.frs.core.trainservice.controller.DetectionController$$EnhancerBySpringCGLIB$$b1c0ae9e.detect() compreface-api | at jdk.internal.reflect.GeneratedMethodAccessor129.invoke(Unknown Source) compreface-api | at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) compreface-api | at java.base/java.lang.reflect.Method.invoke(Unknown Source) compreface-api | at org.springframework.web.method.support.InvocableHandlerMethod.doInvoke(InvocableHandlerMethod.java:190) compreface-api | at org.springframework.web.method.support.InvocableHandlerMethod.invokeForRequest(InvocableHandlerMethod.java:138) compreface-api | at org.springframework.web.servlet.mvc.method.annotation.ServletInvocableHandlerMethod.invokeAndHandle(ServletInvocableHandlerMethod.java:105) compreface-api | at org.springframework.web.servlet.mvc.method.annotation.RequestMappingHandlerAdapter.invokeHandlerMethod(RequestMappingHandlerAdapter.java:878) compreface-api | at org.springframework.web.servlet.mvc.method.annotation.RequestMappingHandlerAdapter.handleInternal(RequestMappingHandlerAdapter.java:792) compreface-api | at org.springframework.web.servlet.mvc.method.AbstractHandlerMethodAdapter.handle(AbstractHandlerMethodAdapter.java:87) compreface-api | at org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherServlet.java:1040) compreface-api | at org.springframework.web.servlet.DispatcherServlet.doService(DispatcherServlet.java:943) compreface-api | at org.springframework.web.servlet.FrameworkServlet.processRequest(FrameworkServlet.java:1006) compreface-api | at org.springframework.web.servlet.FrameworkServlet.doPost(FrameworkServlet.java:909) compreface-api | at javax.servlet.http.HttpServlet.service(HttpServlet.java:652) compreface-api | at org.springframework.web.servlet.FrameworkServlet.service(FrameworkServlet.java:883) compreface-api | at javax.servlet.http.HttpServlet.service(HttpServlet.java:733) compreface-api | at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:231) compreface-api | at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166) compreface-api | at org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:53) compreface-api | at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193) compreface-api | at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166) compreface-api | at com.exadel.frs.core.trainservice.filter.SecurityValidationFilter.doFilter(SecurityValidationFilter.java:124) compreface-api | at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193) compreface-api | at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166) compreface-api | at org.springframework.web.filter.RequestContextFilter.doFilterInternal(RequestContextFilter.java:100) compreface-api | at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:119) compreface-api | at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193) compreface-api | at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166) compreface-api | at org.springframework.web.filter.FormContentFilter.doFilterInternal(FormContentFilter.java:93) compreface-api | at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:119) compreface-api | at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193) compreface-api | at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166) compreface-api | at org.springframework.boot.actuate.metrics.web.servlet.WebMvcMetricsFilter.doFilterInternal(WebMvcMetricsFilter.java:93) compreface-api | at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:119) compreface-api | at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193) compreface-api | at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166) compreface-api | at org.springframework.web.filter.CharacterEncodingFilter.doFilterInternal(CharacterEncodingFilter.java:201) compreface-api | at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:119) compreface-api | at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193) compreface-api | at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166) compreface-api | at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:202) compreface-api | at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:97) compreface-api | at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:541) compreface-api | at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:143) compreface-api | at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:92) compreface-api | at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:78) compreface-api | at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:343) compreface-api | at org.apache.coyote.http11.Http11Processor.service(Http11Processor.java:374) compreface-api | at org.apache.coyote.AbstractProcessorLight.process(AbstractProcessorLight.java:65) compreface-api | at org.apache.coyote.AbstractProtocol$ConnectionHandler.process(AbstractProtocol.java:868) compreface-api | at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1590) compreface-api | at org.apache.tomcat.util.net.SocketProcessorBase.run(SocketProcessorBase.java:49) compreface-api | at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) compreface-api | at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) compreface-api | at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61) compreface-api | at java.base/java.lang.Thread.run(Unknown Source) compreface-api | compreface-ui | 172.20.0.1 - - [22/Jul/2022:21:38:25 +0000] "POST /api/v1/detection/detect?&face_plugins=calculator HTTP/1.1" 500 467 "-" "python-requests/2.25.1" compreface-core | {"severity": "CRITICAL", "message": "MXNetError: [21:38:25] /home/travis/build/dmlc/mxnet-distro/mxnet-build/3rdparty/mshadow/mshadow/././././cuda/tensor_gpu-inl.cuh:110: Check failed: err == cudaSuccess (2 vs. 0) : Name: MapPlanKernel ErrStr:out of memory\nStack trace:\n [bt] (0) /usr/local/lib/python3.7/dist-packages/mxnet/libmxnet.so(+0x4b04cb) [0x7f76f6bbf4cb]\n [bt] (1) /usr/local/lib/python3.7/dist-packages/mxnet/libmxnet.so(+0x2f59431) [0x7f76f9668431]\n [bt] (2) /usr/local/lib/python3.7/dist-packages/mxnet/libmxnet.so(+0x31b61ee) [0x7f76f98c51ee]\n [bt] (3) /usr/local/lib/python3.7/dist-packages/mxnet/libmxnet.so(+0x31b9a16) [0x7f76f98c8a16]\n [bt] (4) /usr/local/lib/python3.7/dist-packages/mxnet/libmxnet.so(+0x25db7a9) [0x7f76f8cea7a9]\n [bt] (5) /usr/local/lib/python3.7/dist-packages/mxnet/libmxnet.so(+0x25e1a1a) [0x7f76f8cf0a1a]\n [bt] (6) /usr/local/lib/python3.7/dist-packages/mxnet/libmxnet.so(+0x25c1cd1) [0x7f76f8cd0cd1]\n [bt] (7) /usr/local/lib/python3.7/dist-packages/mxnet/libmxnet.so(+0x25c51e0) [0x7f76f8cd41e0]\n [bt] (8) /usr/local/lib/python3.7/dist-packages/mxnet/libmxnet.so(+0x25c5476) [0x7f76f8cd4476]\n\n", "request": {"method": "POST", "path": "/find_faces", "filename": "image.jpg", "api_key": "", "remoteaddr": "172.20.0.4"}, "logger": "src.services.flask.error_handling", "module": "error_handling", "traceback": "Traceback (most recent call last):\n File \"/usr/local/lib/python3.7/dist-packages/flask/app.py\", line 1950, in full_dispatch_request\n rv = self.dispatch_request()\n File \"/usr/local/lib/python3.7/dist-packages/flask/app.py\", line 1936, in dispatch_request\n return self.view_functionsrule.endpoint\n File \"./src/services/flask_/needs_attached_file.py\", line 32, in wrapper\n return f(args, **kwargs)\n File \"./src/_endpoints.py\", line 72, in find_faces_post\n face_plugins=face_plugins\n File \"./src/services/facescan/plugins/mixins.py\", line 44, in call\n faces = self._fetch_faces(img, det_prob_threshold)\n File \"./src/services/facescan/plugins/mixins.py\", line 51, in _fetch_faces\n boxes = self.find_faces(img, det_prob_threshold)\n File \"./src/services/facescan/plugins/insightface/insightface.py\", line 83, in find_faces\n results = self._detection_model.get(img, det_thresh=det_prob_threshold)\n File \"/usr/local/lib/python3.7/dist-packages/insightface/app/face_analysis.py\", line 39, in get\n bboxes, landmarks = self.det_model.detect(img, threshold=det_thresh, scale = det_scale)\n File \"/usr/local/lib/python3.7/dist-packages/insightface/model_zoo/face_detection.py\", line 303, in detect\n scores = net_out[idx].asnumpy()\n File \"/usr/local/lib/python3.7/dist-packages/mxnet/ndarray/ndarray.py\", line 1996, in asnumpy\n ctypes.c_size_t(data.size)))\n File \"/usr/local/lib/python3.7/dist-packages/mxnet/base.py\", line 253, in check_call\n raise MXNetError(py_str(_LIB.MXGetLastError()))\nmxnet.base.MXNetError: [21:38:25] /home/travis/build/dmlc/mxnet-distro/mxnet-build/3rdparty/mshadow/mshadow/././././cuda/tensor_gpu-inl.cuh:110: Check failed: err == cudaSuccess (2 vs. 0) : Name: MapPlanKernel ErrStr:out of memory\nStack trace:\n [bt] (0) /usr/local/lib/python3.7/dist-packages/mxnet/libmxnet.so(+0x4b04cb) [0x7f76f6bbf4cb]\n [bt] (1) /usr/local/lib/python3.7/dist-packages/mxnet/libmxnet.so(+0x2f59431) [0x7f76f9668431]\n [bt] (2) /usr/local/lib/python3.7/dist-packages/mxnet/libmxnet.so(+0x31b61ee) [0x7f76f98c51ee]\n [bt] (3) /usr/local/lib/python3.7/dist-packages/mxnet/libmxnet.so(+0x31b9a16) [0x7f76f98c8a16]\n [bt] (4) /usr/local/lib/python3.7/dist-packages/mxnet/libmxnet.so(+0x25db7a9) [0x7f76f8cea7a9]\n [bt] (5) /usr/local/lib/python3.7/dist-packages/mxnet/libmxnet.so(+0x25e1a1a) [0x7f76f8cf0a1a]\n [bt] (6) /usr/local/lib/python3.7/dist-packages/mxnet/libmxnet.so(+0x25c1cd1) [0x7f76f8cd0cd1]\n [bt] (7) /usr/local/lib/python3.7/dist-packages/mxnet/libmxnet.so(+0x25c51e0) [0x7f76f8cd41e0]\n [bt] (8) /usr/local/lib/python3.7/dist-packages/mxnet/libmxnet.so(+0x25c5476) [0x7f76f8cd4476]\n\n\n", "build_version": "dev"}

martinenkoEduard commented 2 years ago

Checked through - watch -n0.1 nvidia-smi it goes out of video memory. And it seems that it never cleans it. Because video memory in use only increases...

martinenkoEduard commented 2 years ago

It much more likely to happen if I use check an image with several faces on it.

pospielov commented 2 years ago

In one of the threads, you asked about adding processes in Python. Each process loads the neural network to GPU and it doesn't release the memory. It doesn't make sense to release the memory as it takes too much time to load NN to it. It shouldn't reproduce with one process. So basically, you are limited with the number of processes by GPU memory.

allen20200111 commented 1 year ago

i have the same problem, config two processs and one thread, the GPU memory only increases sometimes.

In one of the threads, you asked about adding processes in Python. Each process loads the neural network to GPU and it doesn't release the memory. It doesn't make sense to release the memory as it takes too much time to load NN to it. It shouldn't reproduce with one process. So basically, you are limited with the number of processes by GPU memory.

pospielov commented 1 year ago

I created a bug to investigate not sure if we will be able to fix it, as we use the Insightface library as is, without changes under the hood.