c10::CUDAError happens Occasionally

lin88lin8850 commented 1 year ago

`terminate called after throwing an instance of 'c10::CUDAError' what(): CUDA error: driver shutting down Exception raised from query at /opt/pytorch/pytorch/aten/src/ATen/cuda/CUDAEvent.h:95 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x6c (0x7f3c125c863c in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x11d (0x7f3c161b911d in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x78 (0x7f3c161bc9b8 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x158 (0x7f3c161bed68 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0xd6de4 (0x7f3c69049de4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6) frame #5: + 0x9609 (0x7f3cab101609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0) frame #6: clone + 0x43 (0x7f3caaec3293 in /usr/lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::CUDAError' what(): CUDA error: driver shutting down Exception raised from query at /opt/pytorch/pytorch/aten/src/ATen/cuda/CUDAEvent.h:95 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x6c (0x7f00ebf6f63c in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x11d (0x7f00efb6011d in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x78 (0x7f00efb639b8 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x158 (0x7f00efb65d68 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0xd6de4 (0x7f3c69049de4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6) frame #5: + 0x9609 (0x7f3cab101609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0) frame #6: clone + 0x43 (0x7f3caaec3293 in /usr/lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::CUDAError' what(): CUDA error: driver shutting down Exception raised from query at /opt/pytorch/pytorch/aten/src/ATen/cuda/CUDAEvent.h:95 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x6c (0x7f00ebf6f63c in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x11d (0x7f00efb6011d in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x78 (0x7f00efb639b8 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x158 (0x7f00efb65d68 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0xd6de4 (0x7f01429f0de4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6) frame #5: + 0x9609 (0x7f0184aa8609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0) frame #6: clone + 0x43 (0x7f018486a293 in /usr/lib/x86_64-linux-gnu/libc.so.6)

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 2 (pid: 116818) of binary: /opt/conda/bin/python Traceback (most recent call last): File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 191, in main() File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 187, in main launch(args) File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 173, in launch run(args) File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 688, in run elastic_launch( File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

       examples/train_gpt.py FAILED

================================================== Root Cause: [0]: time: 2022-12-07_05:33:08 rank: 2 (local_rank: 2) exitcode: -6 (pid: 116818) error_file: <N/A> msg: "Signal 6 (SIGABRT) received by PID 116818" Other Failures: [1]: time: 2022-12-07_05:33:08 rank: 3 (local_rank: 3) exitcode: -6 (pid: 116819) error_file: <N/A> msg: "Signal 6 (SIGABRT) received by PID 116819"

` torch version: 1.10 cuda version: 11.4

Have u ever meet this problem when use merak train gpt2? Any ideas would be appreciated

lucasleesw commented 1 year ago

Thanks for using merak! This looks like related to NCCL, which we think could be network issue. We meet this kind of issue rarely but it indeed exits.

lin88lin8850 commented 1 year ago

Thanks for quickly reply!
I will leave this issue open if the problem is figured out in the future，since it happens quite frequently here. I have tried to run the program in anthoer A100 machine，the problem still exists.

jaywonchung commented 1 year ago

I was mostly able to eliminate this error output by sleeping three seconds before I terminate training. I was guessing these errors were from CUDA/NCCL resources not perfectly cleaned up at some layer of the code.

HPDL-Group / Merak

c10::CUDAError happens Occasionally #3