No training with multi nodes using torchrun

I run on 2 notes:

torchrun --master-addr [] \
--master-port [] \
--node_rank [] \
--nnodes [] \
--nproc-per-node [] \
-m dora run \
solver=musicgen/musicgen_base_32khz \
model/lm/model_scale=small  \
conditioner=text2music \
dset=audio/example \
dataset.batch_size=12 \
continue_from=//pretrained/facebook/musicgen-small \

The process starts successfully on both nodes. However, then I don’t see the logs for a long time and suspect that the iterations don't go. The process freezes as is, but the cards are loaded.

UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  storage_data_ptr = tensors[0].storage().data_ptr()
/home/user/.conda/envs/musicgen/lib/python3.9/site-packages/xformers/ops/unbind.py:46: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  storage_data_ptr = tensors[0].storage().data_ptr()
/home/user/.conda/envs/musicgen/lib/python3.9/site-packages/torch/cuda/__init__.py:762: UserWarning: Synchronization debug mode is a prototype feature and does not yet detect all synchronizing operations (Triggered internally at ../torch/csrc/cuda/Module.cpp:830.)
  torch._C._cuda_set_sync_debug_mode(debug_mode)
/home/user/.conda/envs/musicgen/lib/python3.9/site-packages/torch/cuda/__init__.py:762: UserWarning: Synchronization debug mode is a prototype feature and does not yet detect all synchronizing operations (Triggered internally at ../torch/csrc/cuda/Module.cpp:830.)
  torch._C._cuda_set_sync_debug_mode(debug_mode)
/mnt/nvme_raid0/user/audiocraft/audiocraft/solvers/musicgen.py:221: UserWarning: called a synchronizing CUDA operation (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:148.)
  ce_targets = targets_k[mask_k]
/mnt/nvme_raid0/user/audiocraft/audiocraft/solvers/musicgen.py:222: UserWarning: called a synchronizing CUDA operation (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:148.)
  ce_logits = logits_k[mask_k]
/mnt/nvme_raid0/user/audiocraft/audiocraft/solvers/musicgen.py:221: UserWarning: called a synchronizing CUDA operation (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:148.)
  ce_targets = targets_k[mask_k]
/mnt/nvme_raid0/user/audiocraft/audiocraft/solvers/musicgen.py:222: UserWarning: called a synchronizing CUDA operation (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:148.)
  ce_logits = logits_k[mask_k]

When I try to run it on one node with the same command, after a short period of time I see how iterations change and everything works. I also always see these warnings, but I don't know how to fix it. Although the main problem is to start training on several nodes without slarm.

facebookresearch / audiocraft

No training with multi nodes using torchrun #369