How can I train MusicGen Model on Distributed Model Parallel

rohandubey commented 11 months ago

This involves partitioning your model into segments and distributing those segments across your available CUDA devices. Each device will compute forward and backward passes for its respective segment of the model. This way I can distribute my model across GPUs and sequentially train the model. Any guidance or help?

Thank you for your time!

adefossez commented 11 months ago

on a single node, jsut run dora run -d [other options see training docs] fsdp.use=true autocast=false on multi node, you would have to use SLURM. at the moment we don't really officially support training in multi node without a Slurm cluster.

rohandubey commented 11 months ago

Thank you for the solution! I need to generate dataset for MusicGen-melody. Should I create manifest file using AudioDataset? Like this: python -m audiocraft.data.audio_dataset dataset/example egs/example/data.jsonl If not, which module should I use? And what will the .jsonl file look like in the output?

For the one generated by AudioDataset the the output in egs/example/data.jsonl looks like:

{"path": "dataset/example/electro_1.mp3", "duration": 15.024, "sample_rate": 48000, "amplitude": null, "weight": null, "info_path": null}
{"path": "dataset/example/electro_2.mp3", "duration": 20.035918367346937, "sample_rate": 44100, "amplitude": null, "weight": null, "info_path": null}

Thank you for your time!

adefossez commented 11 months ago

Yes, please follow the instructions there for creating the manifest: https://github.com/facebookresearch/audiocraft/blob/main/docs/DATASETS.md#creating-manifest-files

you will also need to create a datasource (a yaml file that contains the pointer to all the necessary manifest files for train valid and evals): https://github.com/facebookresearch/audiocraft/blob/main/docs/DATASETS.md#example

rohandubey commented 11 months ago

Understanding, but additional metadata regarding such as key, instruments sample rate are not getting stored in the manifest files created. From my knowledge, we need music description also to be passed while fine-tuning the melody model.

adefossez commented 11 months ago

additional metadata is can be provided in a .json file that is next to the audio file (same filename, jsut different extension). This class defines the possible entries in the json file: https://github.com/facebookresearch/audiocraft/blob/main/audiocraft/data/music_dataset.py#L37

You are responsible for generating this file by your own means! note that at the moment only the description entry is actually used.

rohandubey commented 11 months ago

So, for the melody model too just the description entry is used? So the manifest will look something like this:

{"path": "path", "duration": 15, "sample_rate": 48000, "amplitude": null, "weight": null, "info_path": null, "description": "sample discription"}

adefossez commented 11 months ago

For the melody, the original wav is directly processed.

This computation can be quite slow and so it is advised to activate the cache for it, see https://github.com/facebookresearch/audiocraft/blob/main/docs/CONDITIONING.md#faster-computation-of-conditions

We do not have a dedicated script for populating the cache, instead we recommend launching a dummy training with a super tiny model to quickly go through the dataset.

adefossez commented 11 months ago

If you would need to train on multiple nodes without slurm, see those instructions: https://github.com/facebookresearch/dora#multi-node-training-without-slurm

rohandubey commented 11 months ago

Hi @adefossez , What is the memory requirement for fine-tuning the melody model? I am running A100 40GB and at a batch size of 1 but still getting out-of-memory errors. Any help?

rohandubey commented 11 months ago

Thanks for your help! I am closing this issue.

X-Drunker commented 10 months ago

on a single node, jsut run dora run -d [other options see training docs] fsdp.use=true autocast=false on multi node, you would have to use SLURM. at the moment we don't really officially support training in multi node without a Slurm cluster.

Hello，Im trying to train MusicGen on 1 node with 8 V100 GPUs and I ran this command before I got the error above:

Error executing job with overrides: ['solver=musicgen/musicgen_base_32khz.yaml', 'compression_model_checkpoint=//pretrained/facebook/encodec_32khz', 'transformer_lm.n_q=4', 'transformer_lm.card=2048', 'fsdp.use=true', 'autocast=false']
Traceback (most recent call last):
  File "/scratch/amlt_code/audiocraft/train.py", line 133, in main
    solver = get_solver(cfg)
  File "/scratch/amlt_code/audiocraft/train.py", line 47, in get_solver
    solver = solvers.get_solver(cfg)
  File "/scratch/amlt_code/audiocraft/solvers/builders.py", line 56, in get_solver
    return klass(cfg)  # type: ignore
  File "/scratch/amlt_code/audiocraft/solvers/musicgen.py", line 38, in __init__
    super().__init__(cfg)
  File "/scratch/amlt_code/audiocraft/solvers/base.py", line 79, in __init__
    self.build_model()
  File "/scratch/amlt_code/audiocraft/solvers/musicgen.py", line 140, in build_model
    self.model = self.wrap_with_fsdp(self.model)
  File "/scratch/amlt_code/audiocraft/solvers/base.py", line 142, in wrap_with_fsdp
    model = fsdp.wrap_with_fsdp(self.cfg.fsdp, model, *args, **kwargs)
  File "/scratch/amlt_code/audiocraft/optim/fsdp.py", line 99, in wrap_with_fsdp
    wrapped = _FSDPFixStateDict(
  File "/home/aiscuser/.local/lib/python3.8/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 360, in __init__
    _init_process_group_state(
  File "/home/aiscuser/.local/lib/python3.8/site-packages/torch/distributed/fsdp/_init_utils.py", line 117, in _init_process_group_state
    process_group if process_group is not None else _get_default_group()
  File "/home/aiscuser/.local/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 707, in _get_default_group
    raise RuntimeError(
RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.

When I simply ran dora run -d [other options see training docs] without following parameters fsdp.use=true autocast=false, the training process was launched but unfortunately I got only one GPU working. Logs are similar to this comment and I got this warning

/home/aiscuser/.local/lib/python3.8/site-packages/torch/cuda/__init__.py:762: UserWarning: Synchronization debug mode is a prototype feature and does not yet detect all synchronizing operations (Triggered internally at ../torch/csrc/cuda/Module.cpp:830.)
  torch._C._cuda_set_sync_debug_mode(debug_mode)
/scratch/amlt_code/audiocraft/solvers/musicgen.py:220: UserWarning: called a synchronizing CUDA operation (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:148.)
  ce_targets = targets_k[mask_k]
/scratch/amlt_code/audiocraft/solvers/musicgen.py:221: UserWarning: called a synchronizing CUDA operation (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:148.)
  ce_logits = logits_k[mask_k]
[[36m10-13 02:15:46[0m][[34mflashy.solver[0m][[32mINFO[0m] - Train | Epoch 1 | 200/2000 | 1.64 it/sec | lr 2.50E-02 | grad_norm 1.816E+01 | grad_scale 65536.000 | ce 7.970 | ppl 2933.386[0m
...

So are the following parameters fsdp.use=true autocast=false necessary for multi-GPU training? If so, how should I solve the problem? Really appreciate your reply.

facebookresearch / audiocraft

How can I train MusicGen Model on Distributed Model Parallel #273