Open KumoLiu opened 1 month ago
Executing Cell 19-------------------------------------- INFO:notebook:Training the model... INFO:training:Using cuda:0 of 1 INFO:training:[config] ckpt_folder -> ./temp_work_dir/./models. INFO:training:[config] data_root -> ./temp_work_dir/./embeddings. INFO:training:[config] data_list -> ./temp_work_dir/sim_datalist.json. INFO:training:[config] lr -> 0.0001. INFO:training:[config] num_epochs -> 2. INFO:training:[config] num_train_timesteps -> 1000. INFO:training:num_files_train: 2 INFO:training:Training from scratch. INFO:training:Scaling factor set to 1.159390926361084. INFO:training:scale_factor -> 1.159390926361084. INFO:training:torch.set_float32_matmul_precision -> highest. INFO:training:Epoch 1, lr 0.0001. WARNING:py.warnings:cuDNN SDPA backward got grad_output.strides() != output.strides(), attempting to materialize a grad_output with matching strides... (Triggered internally at /opt/pytorch/pytorch/aten/src/ATen/native/cudnn/MHA.cpp:667.) INFO:training:[2024-09-30 10:58:31] epoch 1, iter 1/2, loss: 0.7987, lr: 0.000100000000. INFO:training:[2024-09-30 10:58:31] epoch 1, iter 2/2, loss: 0.7931, lr: 0.000056250000. INFO:training:epoch 1 average loss: 0.7959. INFO:training:Epoch 2, lr 2.5e-05. INFO:training:[2024-09-30 10:58:32] epoch 2, iter 1/2, loss: 0.7952, lr: 0.000025000000. INFO:training:[2024-09-30 10:58:32] epoch 2, iter 2/2, loss: 0.7875, lr: 0.000006250000. INFO:training:epoch 2 average loss: 0.7913. [ipp2-0112:385 :0:987] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x64616f74) ==== backtrace (tid: 987) ==== 0 0x0000000000042520 __sigaction() ???:0 1 0x00000000000459e0 __cxa_finalize() ???:0 2 0x000000000030ee76 ???() /usr/local/cuda/targets/x86_64-linux/lib/libnvrtc.so.12:0 ================================= E0930 11:15:58.328000 140660921045632 torch/distributed/elastic/multiprocessing/api.py:863] failed (exitcode: -11) local_rank: 0 (pid: 385) of binary: /usr/bin/python Traceback (most recent call last): File "/usr/local/bin/torchrun", line 33, in <module> sys.exit(load_entry_point('torch==2.5.0a0+872d972e41.nv24.8.1', 'console_scripts', 'torchrun')()) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper return f(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 919, in main run(args) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 910, in run elastic_launch( File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 138, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 269, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ===================================================== scripts.diff_model_train FAILED ----------------------------------------------------- Failures: <NO_OTHER_FAILURES> ----------------------------------------------------- Root Cause (first observed failure): [0]: time : 2024-09-30_11:15:58 host : ipp2-0112.ipp2u1.colossus.nvidia.com rank : 0 (local_rank: 0) exitcode : -11 (pid: 385) error_file: <N/A> traceback : Signal 11 (SIGSEGV) received by PID 385 =====================================================
After investigate, I find set amp=False will not raise the issue. I add an argument in https://github.com/Project-MONAI/tutorials/pull/1857 to to be a workaround for this issue.
amp=False
cc @dongyang0122