Project-MONAI / tutorials

MONAI Tutorials
https://monai.io/started.html
Apache License 2.0
1.88k stars 683 forks source link

maisi_diff_unet_training_tutorial.ipynb hit random Segmentation fault on H100 #1858

Open KumoLiu opened 1 month ago

KumoLiu commented 1 month ago
Executing Cell 19--------------------------------------
INFO:notebook:Training the model...

INFO:training:Using cuda:0 of 1
INFO:training:[config] ckpt_folder -> ./temp_work_dir/./models.
INFO:training:[config] data_root -> ./temp_work_dir/./embeddings.
INFO:training:[config] data_list -> ./temp_work_dir/sim_datalist.json.
INFO:training:[config] lr -> 0.0001.
INFO:training:[config] num_epochs -> 2.
INFO:training:[config] num_train_timesteps -> 1000.
INFO:training:num_files_train: 2
INFO:training:Training from scratch.
INFO:training:Scaling factor set to 1.159390926361084.
INFO:training:scale_factor -> 1.159390926361084.
INFO:training:torch.set_float32_matmul_precision -> highest.
INFO:training:Epoch 1, lr 0.0001.
WARNING:py.warnings:cuDNN SDPA backward got grad_output.strides() != output.strides(), attempting to materialize a grad_output with matching strides... (Triggered internally at /opt/pytorch/pytorch/aten/src/ATen/native/cudnn/MHA.cpp:667.)

INFO:training:[2024-09-30 10:58:31] epoch 1, iter 1/2, loss: 0.7987, lr: 0.000100000000.
INFO:training:[2024-09-30 10:58:31] epoch 1, iter 2/2, loss: 0.7931, lr: 0.000056250000.
INFO:training:epoch 1 average loss: 0.7959.
INFO:training:Epoch 2, lr 2.5e-05.
INFO:training:[2024-09-30 10:58:32] epoch 2, iter 1/2, loss: 0.7952, lr: 0.000025000000.
INFO:training:[2024-09-30 10:58:32] epoch 2, iter 2/2, loss: 0.7875, lr: 0.000006250000.
INFO:training:epoch 2 average loss: 0.7913.
[ipp2-0112:385  :0:987] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x64616f74)
==== backtrace (tid:    987) ====
 0 0x0000000000042520 __sigaction()  ???:0
 1 0x00000000000459e0 __cxa_finalize()  ???:0
 2 0x000000000030ee76 ???()  /usr/local/cuda/targets/x86_64-linux/lib/libnvrtc.so.12:0
=================================
E0930 11:15:58.328000 140660921045632 torch/distributed/elastic/multiprocessing/api.py:863] failed (exitcode: -11) local_rank: 0 (pid: 385) of binary: /usr/bin/python
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==2.5.0a0+872d972e41.nv24.8.1', 'console_scripts', 'torchrun')())
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 919, in main
    run(args)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 910, in run
    elastic_launch(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 138, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
=====================================================
scripts.diff_model_train FAILED
-----------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
-----------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-09-30_11:15:58
  host      : ipp2-0112.ipp2u1.colossus.nvidia.com
  rank      : 0 (local_rank: 0)
  exitcode  : -11 (pid: 385)
  error_file: <N/A>
  traceback : Signal 11 (SIGSEGV) received by PID 385
=====================================================
KumoLiu commented 1 month ago

After investigate, I find set amp=False will not raise the issue. I add an argument in https://github.com/Project-MONAI/tutorials/pull/1857 to to be a workaround for this issue.

cc @dongyang0122