bigscience-workshop / Megatron-DeepSpeed

Ongoing research training transformer language models at scale, including: BERT & GPT-2
Other
1.31k stars 213 forks source link

pretrain_gpt_distributed.sh ERROR! #341

Closed cdj0311 closed 2 years ago

cdj0311 commented 2 years ago

Hi, I run pretrain_gpt_distributed.sh to pretrain gpt2, but get an error as follows, what's the problem?

Loading extension module fused_mix_prec_layer_norm_cuda... [610867610e4b:2498 :0:2599] Caught signal 7 (Bus error: nonexistent physical address)

0 0x0000000000014420 funlockfile() ???:0 1 0x000000000018bb41 nss_database_lookup() ???:0 2 0x000000000006929c ncclGroupEnd() ???:0 3 0x000000000006b9ae ncclGroupEnd() ???:0 4 0x0000000000050853 ncclGetUniqueId() ???:0 5 0x00000000000417b4 ???() /lib/x86_64-linux-gnu/libnccl.so.2:0 6 0x0000000000042c4d ???() /lib/x86_64-linux-gnu/libnccl.so.2:0 7 0x0000000000058b37 ncclRedOpDestroy() ???:0 8 0x0000000000008609 start_thread() ???:0 9 0x000000000011f133 clone() ???:0

Fatal Python error: Bus error

Thread 0x00007f8bf451b740 (most recent call first): File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2792 in barrier File "/workspace/Megatron-DeepSpeed/megatron/initialize.py", line 217 in _compile_dependencies File "/workspace/Megatron-DeepSpeed/megatron/initialize.py", line 164 in initialize_megatron File "/workspace/Megatron-DeepSpeed/megatron/training.py", line 99 in pretrain File "pretrain_gpt.py", line 231 in main File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345 in wrapper File "pretrain_gpt.py", line 235 in WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2497 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -7) local_rank: 1 (pid: 2498) of binary: /opt/conda/bin/python Traceback (most recent call last): File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in main() File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main launch(args) File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch run(args) File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run elastic_launch( File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

pretrain_gpt.py FAILED

Failures:

---------------------------------------------------- Root Cause (first observed failure): [0]: time : 2022-09-02_06:14:05 host : 610867610e4b rank : 1 (local_rank: 1) exitcode : -7 (pid: 2498) error_file: traceback : Signal 7 (SIGBUS) received by PID 2498