cavalleria / cavaface

face recognition training project(pytorch)
MIT License
459 stars 87 forks source link

RuntimeError: CUDA error: unspecified launch failure #70

Closed Nakupenda-7 closed 3 years ago

Nakupenda-7 commented 3 years ago

Before running mobilefacenet, there was no problem, running IR-SE had a problem, the problem is as follows, find a solution. Traceback (most recent call last): File "train.py", line 367, in main() File "train.py", line 44, in main mp.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, cfg, val_dataset)) File "/usr/local/lib/python3.5/dist-packages/torch/multiprocessing/spawn.py", line 171, in spawn while not spawn_context.join(): File "/usr/local/lib/python3.5/dist-packages/torch/multiprocessing/spawn.py", line 118, in join raise Exception(msg) Exception: terminate called after throwing an instance of 'c10::Error' what(): CUDA error: unspecified launch failure (insert_events at /pytorch/c10/cuda/CUDACachingAllocator.cpp:764) frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x33 (0x7fd12bbbb193 in /usr/local/lib/python3.5/dist-packages/torch/lib/libc10.so) frame #1: + 0x17f66 (0x7fd12bdf8f66 in /usr/local/lib/python3.5/dist-packages/torch/lib/libc10_cuda.so) frame #2: + 0x19cbd (0x7fd12bdfacbd in /usr/local/lib/python3.5/dist-packages/torch/lib/libc10_cuda.so) frame #3: c10::TensorImpl::release_resources() + 0x4d (0x7fd12bbab63d in /usr/local/lib/python3.5/dist-packages/torch/lib/libc10.so) frame #4: + 0x48f5cb (0x7fd12cc3d5cb in /usr/local/lib/python3.5/dist-packages/torch/lib/libtorch_python.so) frame #5: c10::TensorImpl::release_resources() + 0x20 (0x7fd12bbab610 in /usr/local/lib/python3.5/dist-packages/torch/lib/libc10.so) frame #6: + 0x67aba2 (0x7fd12ce28ba2 in /usr/local/lib/python3.5/dist-packages/torch/lib/libtorch_python.so) frame #7: + 0x67ac46 (0x7fd12ce28c46 in /usr/local/lib/python3.5/dist-packages/torch/lib/libtorch_python.so) frame #8: /usr/bin/python() [0x56d0f8] frame #9: /usr/bin/python() [0x586901] frame #10: /usr/bin/python() [0x5e6aa2] frame #11: /usr/bin/python() [0x56e9c9] frame #12: /usr/bin/python() [0x51070d] frame #13: /usr/bin/python() [0x609b8d] frame #14: PyGC_Collect + 0x1e (0x609bee in /usr/bin/python) frame #15: Py_Finalize + 0x59 (0x624929 in /usr/bin/python) frame #16: Py_Exit + 0x8 (0x624a28 in /usr/bin/python) frame #17: /usr/bin/python() [0x624b1a] frame #18: PyErr_PrintEx + 0x36 (0x624b86 in /usr/bin/python) frame #19: PyRun_SimpleStringFlags + 0x67 (0x6257d7 in /usr/bin/python) frame #20: Py_Main + 0x581 (0x63efe1 in /usr/bin/python) frame #21: main + 0xe1 (0x4d13f1 in /usr/bin/python) frame #22: __libc_start_main + 0xf0 (0x7fd1312a4840 in /lib/x86_64-linux-gnu/libc.so.6) frame #23: _start + 0x29 (0x5d62d9 in /usr/bin/python) -- Process 0 terminated with the following error: Traceback (most recent call last): File "/usr/local/lib/python3.5/dist-packages/torch/multiprocessing/spawn.py", line 19, in _wrap fn(i, *args) File "/tensorflow-facenet/cavaface.pytorch/train.py", line 295, in main_worker scaled_loss.backward() File "/usr/lib/python3.5/contextlib.py", line 66, in exit next(self.gen) File "/usr/local/lib/python3.5/dist-packages/apex/amp/handle.py", line 123, in scale_loss optimizer._post_amp_backward(loss_scaler) File "/usr/local/lib/python3.5/dist-packages/apex/amp/_process_optimizer.py", line 190, in post_backward_with_master_weights models_are_masters=False) File "/usr/local/lib/python3.5/dist-packages/apex/amp/scaler.py", line 119, in unscale self.unscale_python(model_grads, master_grads, scale) File "/usr/local/lib/python3.5/dist-packages/apex/amp/scaler.py", line 89, in unscale_python self.dynamic) File "/usr/local/lib/python3.5/dist-packages/apex/amp/scaler.py", line 9, in scale_check_overflow_python cpu_sum = float(model_grad.float().sum()) RuntimeError: CUDA error: unspecified launch failure