facebookresearch / audiocraft

Audiocraft is a library for audio processing and generation with deep learning. It features the state-of-the-art EnCodec audio compressor / tokenizer, along with MusicGen, a simple and controllable music generation LM with textual and melodic conditioning.
MIT License
20.18k stars 2.01k forks source link

Do i need to run the slurm cluster for training MusicGen using dora ? #342

Open piyuch opened 8 months ago

piyuch commented 8 months ago
Traceback (most recent call last): 
  File "/opt/conda/bin/dora", line 8, in <module>
    sys.exit(main())
  File "/opt/conda/lib/python3.10/site-packages/dora/__main__.py", line 170, in main
    args.action(args, main)
  File "/opt/conda/lib/python3.10/site-packages/dora/grid.py", line 138, in grid_action
    run_grid(main, explorer, args.grid, rules, slurm, grid_args)
  File "/opt/conda/lib/python3.10/site-packages/dora/grid.py", line 265, in run_grid
    shepherd.commit()
  File "/opt/conda/lib/python3.10/site-packages/dora/shep.py", line 242, in commit
    self._submit(job_array)
  File "/opt/conda/lib/python3.10/site-packages/dora/shep.py", line 359, in _submit
    executor = self._get_submitit_executor(name, submitit_folder, slurm_config)
  File "/opt/conda/lib/python3.10/site-packages/dora/shep.py", line 265, in _get_submitit_executor
    executor = submitit.SlurmExecutor(
  File "/opt/conda/lib/python3.10/site-packages/submitit/slurm/slurm.py", line 249, in __init__
    raise RuntimeError('Could not detect "srun", are you indeed on a slurm cluster?')
RuntimeError: Could not detect "srun", are you indeed on a slurm cluster?

Is there a way to run musicGen training by using Dora without setting up slurm cluster ?

adefossez commented 8 months ago

You can have a look at this to run manually the jobs without Slurm: https://github.com/facebookresearch/dora#multi-node-training-without-slurm

You won't be able to use the dora grid command use, although running the grid with dora grid GRID_NAME --dry_run --init will allow you to use the signature with the -f flag in the dora run commands, e.g.

torchrun [...] -m dora run -f SIG

Or for single machine training you can bypass torchrun entirely and just use the -d flag.

jbm-composer commented 3 months ago

You can have a look at this to run manually the jobs without Slurm: https://github.com/facebookresearch/dora#multi-node-training-without-slurm

You won't be able to use the dora grid command use, although running the grid with dora grid GRID_NAME --dry_run --init will allow you to use the signature with the -f flag in the dora run commands, e.g.

torchrun [...] -m dora run -f SIG

Or for single machine training you can bypass torchrun entirely and just use the -d flag.

I'm trying to run from a single AWS node, using the -d flag (with a solver), but I hit a configuration error:

[...]
File "/home/james/src/somms/audiocraft/audiocraft/environment.py", line 77, in _get_cluster_config
    return self.config[self.cluster]
  File "/home/james/miniconda3/envs/torch_ac/lib/python3.10/site-packages/omegaconf/dictconfig.py", line 375, in __getitem__
    self._format_and_raise(key=key, value=None, cause=e)
  File "/home/james/miniconda3/envs/torch_ac/lib/python3.10/site-packages/omegaconf/base.py", line 231, in _format_and_raise
    format_and_raise(
  File "/home/james/miniconda3/envs/torch_ac/lib/python3.10/site-packages/omegaconf/_utils.py", line 899, in format_and_raise
    _raise(ex, cause)
  File "/home/james/miniconda3/envs/torch_ac/lib/python3.10/site-packages/omegaconf/_utils.py", line 797, in _raise
    raise ex.with_traceback(sys.exc_info()[2])  # set env var OC_CAUSE=1 for full trace
  File "/home/james/miniconda3/envs/torch_ac/lib/python3.10/site-packages/omegaconf/dictconfig.py", line 369, in __getitem__
    return self._get_impl(key=key, default_value=_DEFAULT_MARKER_)
  File "/home/james/miniconda3/envs/torch_ac/lib/python3.10/site-packages/omegaconf/dictconfig.py", line 442, in _get_impl
    node = self._get_child(
  File "/home/james/miniconda3/envs/torch_ac/lib/python3.10/site-packages/omegaconf/basecontainer.py", line 73, in _get_child
    child = self._get_node(
  File "/home/james/miniconda3/envs/torch_ac/lib/python3.10/site-packages/omegaconf/dictconfig.py", line 480, in _get_node
    raise ConfigKeyError(f"Missing key {key!s}")
omegaconf.errors.ConfigKeyError: Missing key aws
    full_key: aws
    object_type=dict

Is there a way around this? I thought the -d flag should ignore cluster info but it seems like AWS is being detected automatically.