Open martinenkoEduard opened 2 years ago
Checked through - watch -n0.1 nvidia-smi it goes out of video memory. And it seems that it never cleans it. Because video memory in use only increases...
It much more likely to happen if I use check an image with several faces on it.
In one of the threads, you asked about adding processes in Python. Each process loads the neural network to GPU and it doesn't release the memory. It doesn't make sense to release the memory as it takes too much time to load NN to it. It shouldn't reproduce with one process. So basically, you are limited with the number of processes by GPU memory.
i have the same problem, config two processs and one thread, the GPU memory only increases sometimes.
In one of the threads, you asked about adding processes in Python. Each process loads the neural network to GPU and it doesn't release the memory. It doesn't make sense to release the memory as it takes too much time to load NN to it. It shouldn't reproduce with one process. So basically, you are limited with the number of processes by GPU memory.
I created a bug to investigate not sure if we will be able to fix it, as we use the Insightface library as is, without changes under the hood.
it works for a while (and I must say it is blazingly FAST) and after ~50 images it starts to drop images with this error:
face-api | compreface-ui | 172.20.0.1 - - [22/Jul/2022:21:38:25 +0000] "POST /api/v1/detection/detect?&face_plugins=calculator HTTP/1.1" 500 467 "-" "python-requests/2.25.1" compreface-core | {"severity": "CRITICAL", "message": "MXNetError: [21:38:25] /home/travis/build/dmlc/mxnet-distro/mxnet-build/3rdparty/mshadow/mshadow/././././cuda/tensor_gpu-inl.cuh:110: Check failed: err == cudaSuccess (2 vs. 0) : Name: MapPlanKernel ErrStr:out of memory\nStack trace:\n [bt] (0) /usr/local/lib/python3.7/dist-packages/mxnet/libmxnet.so(+0x4b04cb) [0x7f76f6bbf4cb]\n [bt] (1) /usr/local/lib/python3.7/dist-packages/mxnet/libmxnet.so(+0x2f59431) [0x7f76f9668431]\n [bt] (2) /usr/local/lib/python3.7/dist-packages/mxnet/libmxnet.so(+0x31b61ee) [0x7f76f98c51ee]\n [bt] (3) /usr/local/lib/python3.7/dist-packages/mxnet/libmxnet.so(+0x31b9a16) [0x7f76f98c8a16]\n [bt] (4) /usr/local/lib/python3.7/dist-packages/mxnet/libmxnet.so(+0x25db7a9) [0x7f76f8cea7a9]\n [bt] (5) /usr/local/lib/python3.7/dist-packages/mxnet/libmxnet.so(+0x25e1a1a) [0x7f76f8cf0a1a]\n [bt] (6) /usr/local/lib/python3.7/dist-packages/mxnet/libmxnet.so(+0x25c1cd1) [0x7f76f8cd0cd1]\n [bt] (7) /usr/local/lib/python3.7/dist-packages/mxnet/libmxnet.so(+0x25c51e0) [0x7f76f8cd41e0]\n [bt] (8) /usr/local/lib/python3.7/dist-packages/mxnet/libmxnet.so(+0x25c5476) [0x7f76f8cd4476]\n\n", "request": {"method": "POST", "path": "/find_faces", "filename": "image.jpg", "api_key": "", "remoteaddr": "172.20.0.4"}, "logger": "src.services.flask.error_handling", "module": "error_handling", "traceback": "Traceback (most recent call last):\n File \"/usr/local/lib/python3.7/dist-packages/flask/app.py\", line 1950, in full_dispatch_request\n rv = self.dispatch_request()\n File \"/usr/local/lib/python3.7/dist-packages/flask/app.py\", line 1936, in dispatch_request\n return self.view_functionsrule.endpoint\n File \"./src/services/flask_/needs_attached_file.py\", line 32, in wrapper\n return f(args, **kwargs)\n File \"./src/_endpoints.py\", line 72, in find_faces_post\n face_plugins=face_plugins\n File \"./src/services/facescan/plugins/mixins.py\", line 44, in call\n faces = self._fetch_faces(img, det_prob_threshold)\n File \"./src/services/facescan/plugins/mixins.py\", line 51, in _fetch_faces\n boxes = self.find_faces(img, det_prob_threshold)\n File \"./src/services/facescan/plugins/insightface/insightface.py\", line 83, in find_faces\n results = self._detection_model.get(img, det_thresh=det_prob_threshold)\n File \"/usr/local/lib/python3.7/dist-packages/insightface/app/face_analysis.py\", line 39, in get\n bboxes, landmarks = self.det_model.detect(img, threshold=det_thresh, scale = det_scale)\n File \"/usr/local/lib/python3.7/dist-packages/insightface/model_zoo/face_detection.py\", line 303, in detect\n scores = net_out[idx].asnumpy()\n File \"/usr/local/lib/python3.7/dist-packages/mxnet/ndarray/ndarray.py\", line 1996, in asnumpy\n ctypes.c_size_t(data.size)))\n File \"/usr/local/lib/python3.7/dist-packages/mxnet/base.py\", line 253, in check_call\n raise MXNetError(py_str(_LIB.MXGetLastError()))\nmxnet.base.MXNetError: [21:38:25] /home/travis/build/dmlc/mxnet-distro/mxnet-build/3rdparty/mshadow/mshadow/././././cuda/tensor_gpu-inl.cuh:110: Check failed: err == cudaSuccess (2 vs. 0) : Name: MapPlanKernel ErrStr:out of memory\nStack trace:\n [bt] (0) /usr/local/lib/python3.7/dist-packages/mxnet/libmxnet.so(+0x4b04cb) [0x7f76f6bbf4cb]\n [bt] (1) /usr/local/lib/python3.7/dist-packages/mxnet/libmxnet.so(+0x2f59431) [0x7f76f9668431]\n [bt] (2) /usr/local/lib/python3.7/dist-packages/mxnet/libmxnet.so(+0x31b61ee) [0x7f76f98c51ee]\n [bt] (3) /usr/local/lib/python3.7/dist-packages/mxnet/libmxnet.so(+0x31b9a16) [0x7f76f98c8a16]\n [bt] (4) /usr/local/lib/python3.7/dist-packages/mxnet/libmxnet.so(+0x25db7a9) [0x7f76f8cea7a9]\n [bt] (5) /usr/local/lib/python3.7/dist-packages/mxnet/libmxnet.so(+0x25e1a1a) [0x7f76f8cf0a1a]\n [bt] (6) /usr/local/lib/python3.7/dist-packages/mxnet/libmxnet.so(+0x25c1cd1) [0x7f76f8cd0cd1]\n [bt] (7) /usr/local/lib/python3.7/dist-packages/mxnet/libmxnet.so(+0x25c51e0) [0x7f76f8cd41e0]\n [bt] (8) /usr/local/lib/python3.7/dist-packages/mxnet/libmxnet.so(+0x25c5476) [0x7f76f8cd4476]\n\n\n", "build_version": "dev"} compreface-api | 2022-07-22 21:38:25.481 ERROR 7 --- [nio-8080-exec-4] c.e.f.c.h.ResponseExceptionHandler : Defined exception occurred compreface-api | compreface-api | com.exadel.frs.commonservice.sdk.faces.exception.FacesServiceException: Error during synchronization between servers: [500 INTERNAL SERVER ERROR] during [POST] to [http://compreface-core:3000/find_faces] [FacesFeignClient#findFaces(MultipartFile,Integer,Double,String)]: [{"message":"MXNetError: [21:38:25] /home/travis/build/dmlc/mxnet-distro/mxnet-build/3rdparty/mshadow/mshadow/././././cuda/tensor_gpu-inl.cuh:110: Check failed: err == cudaSuccess (2 vs. 0) : Name: Map... (1133 bytes)] compreface-api | at com.exadel.frs.commonservice.sdk.faces.service.FacesRestApiClient.findFaces(FacesRestApiClient.java:34) compreface-api | at com.exadel.frs.commonservice.sdk.faces.service.FacesRestApiClient$$FastClassBySpringCGLIB$$517e8caf.invoke()
compreface-api | at org.springframework.cglib.proxy.MethodProxy.invoke(MethodProxy.java:218)
compreface-api | at org.springframework.aop.framework.CglibAopProxy$DynamicAdvisedInterceptor.intercept(CglibAopProxy.java:687)
compreface-api | at com.exadel.frs.commonservice.sdk.faces.service.FacesRestApiClient$$EnhancerBySpringCGLIB$$5f1e9a2e.findFaces()
compreface-api | at com.exadel.frs.core.trainservice.service.FaceDetectionProcessServiceImpl.processImage(FaceDetectionProcessServiceImpl.java:31)
compreface-api | at com.exadel.frs.core.trainservice.service.FaceDetectionProcessServiceImpl.processImage(FaceDetectionProcessServiceImpl.java:13)
compreface-api | at com.exadel.frs.core.trainservice.controller.DetectionController.detect(DetectionController.java:71)
compreface-api | at com.exadel.frs.core.trainservice.controller.DetectionController$$FastClassBySpringCGLIB$$6a25be2c.invoke()
compreface-api | at org.springframework.cglib.proxy.MethodProxy.invoke(MethodProxy.java:218)
compreface-api | at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.invokeJoinpoint(CglibAopProxy.java:771)
compreface-api | at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:163)
compreface-api | at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.proceed(CglibAopProxy.java:749)
compreface-api | at org.springframework.validation.beanvalidation.MethodValidationInterceptor.invoke(MethodValidationInterceptor.java:119)
compreface-api | at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:186)
compreface-api | at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.proceed(CglibAopProxy.java:749)
compreface-api | at org.springframework.aop.framework.CglibAopProxy$DynamicAdvisedInterceptor.intercept(CglibAopProxy.java:691)
compreface-api | at com.exadel.frs.core.trainservice.controller.DetectionController$$EnhancerBySpringCGLIB$$b1c0ae9e.detect()
compreface-api | at jdk.internal.reflect.GeneratedMethodAccessor129.invoke(Unknown Source)
compreface-api | at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
compreface-api | at java.base/java.lang.reflect.Method.invoke(Unknown Source)
compreface-api | at org.springframework.web.method.support.InvocableHandlerMethod.doInvoke(InvocableHandlerMethod.java:190)
compreface-api | at org.springframework.web.method.support.InvocableHandlerMethod.invokeForRequest(InvocableHandlerMethod.java:138)
compreface-api | at org.springframework.web.servlet.mvc.method.annotation.ServletInvocableHandlerMethod.invokeAndHandle(ServletInvocableHandlerMethod.java:105)
compreface-api | at org.springframework.web.servlet.mvc.method.annotation.RequestMappingHandlerAdapter.invokeHandlerMethod(RequestMappingHandlerAdapter.java:878)
compreface-api | at org.springframework.web.servlet.mvc.method.annotation.RequestMappingHandlerAdapter.handleInternal(RequestMappingHandlerAdapter.java:792)
compreface-api | at org.springframework.web.servlet.mvc.method.AbstractHandlerMethodAdapter.handle(AbstractHandlerMethodAdapter.java:87)
compreface-api | at org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherServlet.java:1040)
compreface-api | at org.springframework.web.servlet.DispatcherServlet.doService(DispatcherServlet.java:943)
compreface-api | at org.springframework.web.servlet.FrameworkServlet.processRequest(FrameworkServlet.java:1006)
compreface-api | at org.springframework.web.servlet.FrameworkServlet.doPost(FrameworkServlet.java:909)
compreface-api | at javax.servlet.http.HttpServlet.service(HttpServlet.java:652)
compreface-api | at org.springframework.web.servlet.FrameworkServlet.service(FrameworkServlet.java:883)
compreface-api | at javax.servlet.http.HttpServlet.service(HttpServlet.java:733)
compreface-api | at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:231)
compreface-api | at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
compreface-api | at org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:53)
compreface-api | at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193)
compreface-api | at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
compreface-api | at com.exadel.frs.core.trainservice.filter.SecurityValidationFilter.doFilter(SecurityValidationFilter.java:124)
compreface-api | at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193)
compreface-api | at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
compreface-api | at org.springframework.web.filter.RequestContextFilter.doFilterInternal(RequestContextFilter.java:100)
compreface-api | at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:119)
compreface-api | at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193)
compreface-api | at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
compreface-api | at org.springframework.web.filter.FormContentFilter.doFilterInternal(FormContentFilter.java:93)
compreface-api | at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:119)
compreface-api | at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193)
compreface-api | at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
compreface-api | at org.springframework.boot.actuate.metrics.web.servlet.WebMvcMetricsFilter.doFilterInternal(WebMvcMetricsFilter.java:93)
compreface-api | at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:119)
compreface-api | at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193)
compreface-api | at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
compreface-api | at org.springframework.web.filter.CharacterEncodingFilter.doFilterInternal(CharacterEncodingFilter.java:201)
compreface-api | at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:119)
compreface-api | at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193)
compreface-api | at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
compreface-api | at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:202)
compreface-api | at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:97)
compreface-api | at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:541)
compreface-api | at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:143)
compreface-api | at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:92)
compreface-api | at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:78)
compreface-api | at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:343)
compreface-api | at org.apache.coyote.http11.Http11Processor.service(Http11Processor.java:374)
compreface-api | at org.apache.coyote.AbstractProcessorLight.process(AbstractProcessorLight.java:65)
compreface-api | at org.apache.coyote.AbstractProtocol$ConnectionHandler.process(AbstractProtocol.java:868)
compreface-api | at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1590)
compreface-api | at org.apache.tomcat.util.net.SocketProcessorBase.run(SocketProcessorBase.java:49)
compreface-api | at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
compreface-api | at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
compreface-api | at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
compreface-api | at java.base/java.lang.Thread.run(Unknown Source)
compreface-api |
compreface-ui | 172.20.0.1 - - [22/Jul/2022:21:38:25 +0000] "POST /api/v1/detection/detect?&face_plugins=calculator HTTP/1.1" 500 467 "-" "python-requests/2.25.1"
compreface-core | {"severity": "CRITICAL", "message": "MXNetError: [21:38:25] /home/travis/build/dmlc/mxnet-distro/mxnet-build/3rdparty/mshadow/mshadow/././././cuda/tensor_gpu-inl.cuh:110: Check failed: err == cudaSuccess (2 vs. 0) : Name: MapPlanKernel ErrStr:out of memory\nStack trace:\n [bt] (0) /usr/local/lib/python3.7/dist-packages/mxnet/libmxnet.so(+0x4b04cb) [0x7f76f6bbf4cb]\n [bt] (1) /usr/local/lib/python3.7/dist-packages/mxnet/libmxnet.so(+0x2f59431) [0x7f76f9668431]\n [bt] (2) /usr/local/lib/python3.7/dist-packages/mxnet/libmxnet.so(+0x31b61ee) [0x7f76f98c51ee]\n [bt] (3) /usr/local/lib/python3.7/dist-packages/mxnet/libmxnet.so(+0x31b9a16) [0x7f76f98c8a16]\n [bt] (4) /usr/local/lib/python3.7/dist-packages/mxnet/libmxnet.so(+0x25db7a9) [0x7f76f8cea7a9]\n [bt] (5) /usr/local/lib/python3.7/dist-packages/mxnet/libmxnet.so(+0x25e1a1a) [0x7f76f8cf0a1a]\n [bt] (6) /usr/local/lib/python3.7/dist-packages/mxnet/libmxnet.so(+0x25c1cd1) [0x7f76f8cd0cd1]\n [bt] (7) /usr/local/lib/python3.7/dist-packages/mxnet/libmxnet.so(+0x25c51e0) [0x7f76f8cd41e0]\n [bt] (8) /usr/local/lib/python3.7/dist-packages/mxnet/libmxnet.so(+0x25c5476) [0x7f76f8cd4476]\n\n", "request": {"method": "POST", "path": "/find_faces", "filename": "image.jpg", "api_key": "", "remoteaddr": "172.20.0.4"}, "logger": "src.services.flask.error_handling", "module": "error_handling", "traceback": "Traceback (most recent call last):\n File \"/usr/local/lib/python3.7/dist-packages/flask/app.py\", line 1950, in full_dispatch_request\n rv = self.dispatch_request()\n File \"/usr/local/lib/python3.7/dist-packages/flask/app.py\", line 1936, in dispatch_request\n return self.view_functionsrule.endpoint\n File \"./src/services/flask_/needs_attached_file.py\", line 32, in wrapper\n return f( args, **kwargs)\n File \"./src/_endpoints.py\", line 72, in find_faces_post\n face_plugins=face_plugins\n File \"./src/services/facescan/plugins/mixins.py\", line 44, in call\n faces = self._fetch_faces(img, det_prob_threshold)\n File \"./src/services/facescan/plugins/mixins.py\", line 51, in _fetch_faces\n boxes = self.find_faces(img, det_prob_threshold)\n File \"./src/services/facescan/plugins/insightface/insightface.py\", line 83, in find_faces\n results = self._detection_model.get(img, det_thresh=det_prob_threshold)\n File \"/usr/local/lib/python3.7/dist-packages/insightface/app/face_analysis.py\", line 39, in get\n bboxes, landmarks = self.det_model.detect(img, threshold=det_thresh, scale = det_scale)\n File \"/usr/local/lib/python3.7/dist-packages/insightface/model_zoo/face_detection.py\", line 303, in detect\n scores = net_out[idx].asnumpy()\n File \"/usr/local/lib/python3.7/dist-packages/mxnet/ndarray/ndarray.py\", line 1996, in asnumpy\n ctypes.c_size_t(data.size)))\n File \"/usr/local/lib/python3.7/dist-packages/mxnet/base.py\", line 253, in check_call\n raise MXNetError(py_str(_LIB.MXGetLastError()))\nmxnet.base.MXNetError: [21:38:25] /home/travis/build/dmlc/mxnet-distro/mxnet-build/3rdparty/mshadow/mshadow/././././cuda/tensor_gpu-inl.cuh:110: Check failed: err == cudaSuccess (2 vs. 0) : Name: MapPlanKernel ErrStr:out of memory\nStack trace:\n [bt] (0) /usr/local/lib/python3.7/dist-packages/mxnet/libmxnet.so(+0x4b04cb) [0x7f76f6bbf4cb]\n [bt] (1) /usr/local/lib/python3.7/dist-packages/mxnet/libmxnet.so(+0x2f59431) [0x7f76f9668431]\n [bt] (2) /usr/local/lib/python3.7/dist-packages/mxnet/libmxnet.so(+0x31b61ee) [0x7f76f98c51ee]\n [bt] (3) /usr/local/lib/python3.7/dist-packages/mxnet/libmxnet.so(+0x31b9a16) [0x7f76f98c8a16]\n [bt] (4) /usr/local/lib/python3.7/dist-packages/mxnet/libmxnet.so(+0x25db7a9) [0x7f76f8cea7a9]\n [bt] (5) /usr/local/lib/python3.7/dist-packages/mxnet/libmxnet.so(+0x25e1a1a) [0x7f76f8cf0a1a]\n [bt] (6) /usr/local/lib/python3.7/dist-packages/mxnet/libmxnet.so(+0x25c1cd1) [0x7f76f8cd0cd1]\n [bt] (7) /usr/local/lib/python3.7/dist-packages/mxnet/libmxnet.so(+0x25c51e0) [0x7f76f8cd41e0]\n [bt] (8) /usr/local/lib/python3.7/dist-packages/mxnet/libmxnet.so(+0x25c5476) [0x7f76f8cd4476]\n\n\n", "build_version": "dev"}