exadel-inc / CompreFace

Leading free and open-source face recognition system
https://exadel.com/accelerator-showcase/compreface/
Apache License 2.0
5.7k stars 775 forks source link

Erro Nvidia K80 #1009

Open fabio017 opened 1 year ago

fabio017 commented 1 year ago

Hi! I have the following problem when trying to start Compreface/SubCenter-ArcFace-r100-gpu

I have the error RuntimeError: simple_bind error. Arguments: date: (1, 3, 480, 640) Traceback (most recent call last): File "../src/storage/storage.cc", line 97 CUDA: Check failed: e == cudaSuccess || e == cudaErrorCudartUnloading: no CUDA-capable device is detected build_version=dev""

however when I run Compreface/SubCenter-ArcFace-r100 it works normally, On this same Host I use deepstack:gpu-2022.01.1 and it's working ok, I also use frigate .

fabio017 commented 1 year ago

ity": "CRITICAL", "message": "MXNetError: Traceback (most recent call last):\n File \"../include/mshadow/././././cuda/tensor_gpu-inl.cuh\", line 128\nName: Check failed: err == cudaSuccess (209 vs. 0) : MapPlanKernel ErrStr:no kernel image is available for execution on the device", "request": {"method": "POST", "path": "/find_faces", "filename": "lenna.jpg", "api_key": "", "remoteaddr": "172.25.0.3"}, "logger": "src.services.flask.error_handling", "module": "error_handling", "traceback": "Traceback (most recent call last):\n File \"/usr/local/lib/python3.8/dist-packages/flask/app.py\", line 1950, in full_dispatch_request\n rv = self.dispatch_request()\n File \"/usr/local/lib/python3.8/dist-packages/flask/app.py\", line 1936, in dispatch_request\n return self.view_functionsrule.endpoint\n File \"/app/ml/./src/services/flask_/needs_attached_file.py\", line 32, in wrapper\n return f(*args, **kwargs)\n File \"/app/ml/./src/_endpoints.py\", line 69, in find_faces_post\n faces = detector(\n File \"/app/ml/./src/services/facescan/plugins/mixins.py\", line 46, in call\n faces = self._fetch_faces(img, det_prob_threshold)\n File \"/app/ml/./src/services/facescan/plugins/mixins.py\", line 53, in _fetch_faces\n boxes = self.find_faces(img, det_prob_threshold)\n File \"/app/ml/./src/services/facescan/plugins/insightface/insightface.py\", line 86, in find_faces\n results = self._detection_model.get(img, det_thresh=det_prob_threshold)\n File \"/usr/local/lib/python3.8/dist-packages/cached_property.py\", line 36, in get\n value = obj.dict[self.func.name] = self.func(obj)\n File \"/app/ml/./src/services/facescan/plugins/insightface/insightface.py\", line 77, in _detection_model\n model.prepare(ctx_id=self._CTX_ID, nms=self._NMS)\n File \"/usr/local/lib/python3.8/dist-packages/insightface/app/face_analysis.py\", line 32, in prepare\n self.det_model.prepare(ctx_id, nms)\n File \"/usr/local/lib/python3.8/dist-packages/insightface/model_zoo/face_detection.py\", line 223, in prepare\n out = model.get_outputs()[0].asnumpy()\n File \"/usr/local/lib/python3.8/dist-packages/mxnet/ndarray/ndarray.py\", line 2568, in asnumpy\n check_call(_LIB.MXNDArraySyncCopyToCPU(\n File \"/usr/local/lib/python3.8/dist-packages/mxnet/base.py\", line 246, in check_call\n raise get_last_ffi_error()\nmxnet.base.MXNetError: Traceback (most recent call last):\n File \"../include/mshadow/././././cuda/tensor_gpu-inl.cuh\", line 128\nName: Check failed: err == cudaSuccess (209 vs. 0) : MapPlanKernel ErrStr:no kernel image is available for execution on the device\n", "build_version": "dev"}

fabio017 commented 1 year ago

`Mon Jan 9 22:23:33 2023
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 470.161.03 Driver Version: 470.161.03 CUDA Version: 11.4 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla K80 Off | 00000000:03:00.0 Off | 0 | | N/A 45C P0 57W / 149W | 5384MiB / 11441MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 Tesla K80 Off | 00000000:04:00.0 Off | N/A | | N/A 39C P0 72W / 149W | 2316MiB / 11441MiB | 2% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 2 Tesla K80 Off | 00000000:09:00.0 Off | 0 | | N/A 59C P0 57W / 149W | 3386MiB / 11441MiB | 10% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 3 Tesla K80 Off | 00000000:0A:00.0 Off | 0 | | N/A 45C P0 72W / 149W | 2811MiB / 11441MiB | 4% Default | | | | N/A | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| +-----------------------------------------------------------------------------+ WARNING: infoROM is corrupted at gpu 0000:03:00.0 WARNING: infoROM is corrupted at gpu 0000:04:00.0`

fabio017 commented 1 year ago

root@d33f26cf6041:~# curl --retry 10 -s https://raw.githubusercontent.com/dmlc/gluon-nlp/master/tools/diagnose.py | python ----------Python Info---------- Version : 3.8.10 Compiler : GCC 9.4.0 Build : ('default', 'Jun 22 2022 20:18:18') Arch : ('64bit', 'ELF') ------------Pip Info----------- Version : 22.2.2 Directory : /usr/local/lib/python3.8/dist-packages/pip ----------MXNet Info----------- /usr/local/lib/python3.8/dist-packages/mxnet/numpy/utils.py:37: FutureWarning: In the future np.bool will be defined as the corresponding NumPy scalar. (This may have returned Python scalars in past versions. bool = onp.bool An error occured trying to import mxnet. This is very likely due to missing missing or incompatible library files. Traceback (most recent call last): File "", line 118, in check_mxnet File "/usr/local/lib/python3.8/dist-packages/mxnet/init.py", line 33, in from . import contrib File "/usr/local/lib/python3.8/dist-packages/mxnet/contrib/init.py", line 30, in from . import text File "/usr/local/lib/python3.8/dist-packages/mxnet/contrib/text/init.py", line 23, in from . import embedding File "/usr/local/lib/python3.8/dist-packages/mxnet/contrib/text/embedding.py", line 36, in from ... import numpy as _mx_np File "/usr/local/lib/python3.8/dist-packages/mxnet/numpy/init.py", line 23, in from .multiarray import * # pylint: disable=wildcard-import File "/usr/local/lib/python3.8/dist-packages/mxnet/numpy/multiarray.py", line 47, in from .utils import _get_np_op File "/usr/local/lib/python3.8/dist-packages/mxnet/numpy/utils.py", line 37, in bool = onp.bool File "/usr/local/lib/python3.8/dist-packages/numpy/init.py", line 284, in getattr raise AttributeError("module {!r} has no attribute " AttributeError: module 'numpy' has no attribute 'bool'

----------System Info---------- Platform : Linux-5.10.0-20-amd64-x86_64-with-glibc2.29 system : Linux node : d33f26cf6041 release : 5.10.0-20-amd64 version : #1 SMP Debian 5.10.158-2 (2022-12-13) ----------Hardware Info---------- machine : x86_64 processor : x86_64 [23:06:02] ../src/engine/engine.cc:54: MXNet start using engine: ThreadedEnginePerDevice Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 39 bits physical, 48 bits virtual CPU(s): 6 On-line CPU(s) list: 0-5 Thread(s) per core: 1 Core(s) per socket: 6 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 158 Model name: Intel(R) Core(TM) i5-9600K CPU @ 3.70GHz Stepping: 12 CPU MHz: 4300.265 CPU max MHz: 4600.0000 CPU min MHz: 800.0000 BogoMIPS: 7399.70 Virtualization: VT-x L1d cache: 192 KiB L1i cache: 192 KiB L2 cache: 1.5 MiB L3 cache: 9 MiB NUMA node0 CPU(s): 0-5 Vulnerability Itlb multihit: KVM: Mitigation: Split huge pages Vulnerability L1tf: Not affected Vulnerability Mds: Mitigation; Clear CPU buffers; SMT disabled Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Mitigation; Clear CPU buffers; SMT disabled Vulnerability Retbleed: Mitigation; IBRS Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS Not affected Vulnerability Srbds: Vulnerable: No microcode Vulnerability Tsx async abort: Mitigation; Clear CPU buffers; SMT disabled Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pb e syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfm perf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ib pb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_win dow hwp_epp md_clear flush_l1d arch_capabilities ----------Network Test---------- Setting timeout: 10 Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0073 sec, LOAD: 0.8678 sec. Timing for GluonNLP GitHub: https://github.com/dmlc/gluon-nlp, DNS: 0.0072 sec, LOAD: 0.5998 sec. Timing for GluonNLP: http://gluon-nlp.mxnet.io, DNS: 0.1552 sec, LOAD: 1.4356 sec. Timing for D2L: http://d2l.ai, DNS: 0.0043 sec, LOAD: 0.4316 sec. Timing for D2L (zh-cn): http://zh.d2l.ai, DNS: 0.0791 sec, LOAD: 0.5201 sec. Timing for FashionMNIST: https://repo.mxnet.io/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.1834 sec, LOAD: 1.0302 sec. Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0028 sec, LOAD: 0.5007 sec. Error open Conda: https://repo.continuum.io/pkgs/free/, HTTP Error 403: Forbidden, DNS finished in 0.006358623504638672 sec.`

pospielov commented 1 year ago
  1. Do you run "Single-Docker-File" build or "docker compose" build?
  2. Do other containers that use GPU run with CompreFace in parallel?