facebookresearch / audiocraft

Audiocraft is a library for audio processing and generation with deep learning. It features the state-of-the-art EnCodec audio compressor / tokenizer, along with MusicGen, a simple and controllable music generation LM with textual and melodic conditioning.
MIT License
20.18k stars 2.01k forks source link

No training with multi nodes using torchrun #369

Closed ElizavetaSedova closed 6 months ago

ElizavetaSedova commented 6 months ago

I run on 2 notes:

torchrun --master-addr [] \
--master-port [] \
--node_rank [] \
--nnodes [] \
--nproc-per-node [] \
-m dora run \
solver=musicgen/musicgen_base_32khz \
model/lm/model_scale=small  \
conditioner=text2music \
dset=audio/example \
dataset.batch_size=12 \
continue_from=//pretrained/facebook/musicgen-small \

The process starts successfully on both nodes. However, then I don’t see the logs for a long time and suspect that the iterations don't go. The process freezes as is, but the cards are loaded.

UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  storage_data_ptr = tensors[0].storage().data_ptr()
/home/user/.conda/envs/musicgen/lib/python3.9/site-packages/xformers/ops/unbind.py:46: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  storage_data_ptr = tensors[0].storage().data_ptr()
/home/user/.conda/envs/musicgen/lib/python3.9/site-packages/torch/cuda/__init__.py:762: UserWarning: Synchronization debug mode is a prototype feature and does not yet detect all synchronizing operations (Triggered internally at ../torch/csrc/cuda/Module.cpp:830.)
  torch._C._cuda_set_sync_debug_mode(debug_mode)
/home/user/.conda/envs/musicgen/lib/python3.9/site-packages/torch/cuda/__init__.py:762: UserWarning: Synchronization debug mode is a prototype feature and does not yet detect all synchronizing operations (Triggered internally at ../torch/csrc/cuda/Module.cpp:830.)
  torch._C._cuda_set_sync_debug_mode(debug_mode)
/mnt/nvme_raid0/user/audiocraft/audiocraft/solvers/musicgen.py:221: UserWarning: called a synchronizing CUDA operation (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:148.)
  ce_targets = targets_k[mask_k]
/mnt/nvme_raid0/user/audiocraft/audiocraft/solvers/musicgen.py:222: UserWarning: called a synchronizing CUDA operation (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:148.)
  ce_logits = logits_k[mask_k]
/mnt/nvme_raid0/user/audiocraft/audiocraft/solvers/musicgen.py:221: UserWarning: called a synchronizing CUDA operation (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:148.)
  ce_targets = targets_k[mask_k]
/mnt/nvme_raid0/user/audiocraft/audiocraft/solvers/musicgen.py:222: UserWarning: called a synchronizing CUDA operation (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:148.)
  ce_logits = logits_k[mask_k]

When I try to run it on one node with the same command, after a short period of time I see how iterations change and everything works. I also always see these warnings, but I don't know how to fix it. Although the main problem is to start training on several nodes without slarm.