DingXiaoH / RepOptimizers

Official repo of RepOptimizers and RepOpt-VGG
MIT License
247 stars 17 forks source link

训练出错 #2

Closed huoshuai-dot closed 1 year ago

huoshuai-dot commented 1 year ago

数据使用的是torchvision.dataset.imagenet这个接口,但是训练时报错 Traceback (most recent call last): File "main_repopt.py", line 461, in main(config) File "main_repopt.py", line 199, in main train_one_epoch(config, model, criterion, data_loader_train, optimizer, epoch, mixup_fn, lr_scheduler, model_ema=model_ema) File "main_repopt.py", line 298, in train_one_epoch loss.backward() File "/opt/conda/lib/python3.8/site-packages/torch/_tensor.py", line 264, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) File "/opt/conda/lib/python3.8/site-packages/torch/autograd/init.py", line 153, in backward Variable._execution_engine.run_backward( RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc) terminate called after throwing an instance of 'c10::Error' what(): NCCL error in: /opt/pytorch/pytorch/torch/csrc/distributed/c10d/NCCLUtils.hpp:161, unhandled cuda error, NCCL version 21.1.4 ncclUnhandledCudaError: Call to CUDA function failed. Exception raised from ncclCommAbort at /opt/pytorch/pytorch/torch/csrc/distributed/c10d/NCCLUtils.hpp:161 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x6c (0x7f37de87663c in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10.so) frame #1: c10::detail::torchCheckFail(char const, char const, unsigned int, std::cxx11::basic_string<char, std::char_traits, std::allocator > const&) + 0xfa (0x7f37de841a28 in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10.so) frame #2: + 0x3c1e92e (0x7f361ae5892e in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::~ProcessGroupNCCL() + 0xac (0x7f361ae393fc in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so) frame #4: c10d::ProcessGroupNCCL::~ProcessGroupNCCL() + 0xd (0x7f361ae395cd in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so) frame #5: + 0x10f3211 (0x7f366cc99211 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_python.so) frame #6: + 0x1105810 (0x7f366ccab810 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_python.so) frame #7: + 0xa71082 (0x7f366c617082 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_python.so) frame #8: + 0xa72043 (0x7f366c618043 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_python.so) frame #9: + 0xf8b98 (0x56188c6e6b98 in /opt/conda/bin/python3) frame #10: + 0xfa78b (0x56188c6e878b in /opt/conda/bin/python3) frame #11: + 0xf8b4f (0x56188c6e6b4f in /opt/conda/bin/python3) frame #12: + 0x1ef516 (0x56188c7dd516 in /opt/conda/bin/python3) frame #13: + 0x11c574 (0x56188c70a574 in /opt/conda/bin/python3) frame #14: _PyGC_CollectNoFail + 0x2b (0x56188c8435db in /opt/conda/bin/python3) frame #15: PyImport_Cleanup + 0x371 (0x56188c85d7b1 in /opt/conda/bin/python3) frame #16: Py_FinalizeEx + 0x7a (0x56188c85da9a in /opt/conda/bin/python3) frame #17: Py_RunMain + 0x1b8 (0x56188c8625c8 in /opt/conda/bin/python3) frame #18: Py_BytesMain + 0x39 (0x56188c862939 in /opt/conda/bin/python3) frame #19: __libc_start_main + 0xf3 (0x7f37f39470b3 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #20: + 0x1e8f39 (0x56188c7d6f39 in /opt/conda/bin/python3)

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 8744) of binary: /opt/conda/bin/python3 Traceback (most recent call last): File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 191, in main() File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 187, in main launch(args) File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 173, in launch run(args) File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 688, in run elastic_launch( File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:


         main_repopt.py FAILED

================================================ Root Cause: [0]: time: 2022-07-20_08:16:08 rank: 0 (local_rank: 0) exitcode: -6 (pid: 8744) error_file: <N/A> msg: "Signal 6 (SIGABRT) received by PID 8744"

Other Failures:

************************************************ 不知道这个错误的原因是什么?希望大佬们帮我分析一下