Open piyuch opened 1 year ago
You can have a look at this to run manually the jobs without Slurm: https://github.com/facebookresearch/dora#multi-node-training-without-slurm
You won't be able to use the dora grid
command use, although running the grid with dora grid GRID_NAME --dry_run --init
will allow you to use the signature with the -f
flag in the dora run
commands, e.g.
torchrun [...] -m dora run -f SIG
Or for single machine training you can bypass torchrun
entirely and just use the -d
flag.
You can have a look at this to run manually the jobs without Slurm: https://github.com/facebookresearch/dora#multi-node-training-without-slurm
You won't be able to use the
dora grid
command use, although running the grid withdora grid GRID_NAME --dry_run --init
will allow you to use the signature with the-f
flag in thedora run
commands, e.g.torchrun [...] -m dora run -f SIG
Or for single machine training you can bypass
torchrun
entirely and just use the-d
flag.
I'm trying to run from a single AWS node, using the -d
flag (with a solver), but I hit a configuration error:
[...]
File "/home/james/src/somms/audiocraft/audiocraft/environment.py", line 77, in _get_cluster_config
return self.config[self.cluster]
File "/home/james/miniconda3/envs/torch_ac/lib/python3.10/site-packages/omegaconf/dictconfig.py", line 375, in __getitem__
self._format_and_raise(key=key, value=None, cause=e)
File "/home/james/miniconda3/envs/torch_ac/lib/python3.10/site-packages/omegaconf/base.py", line 231, in _format_and_raise
format_and_raise(
File "/home/james/miniconda3/envs/torch_ac/lib/python3.10/site-packages/omegaconf/_utils.py", line 899, in format_and_raise
_raise(ex, cause)
File "/home/james/miniconda3/envs/torch_ac/lib/python3.10/site-packages/omegaconf/_utils.py", line 797, in _raise
raise ex.with_traceback(sys.exc_info()[2]) # set env var OC_CAUSE=1 for full trace
File "/home/james/miniconda3/envs/torch_ac/lib/python3.10/site-packages/omegaconf/dictconfig.py", line 369, in __getitem__
return self._get_impl(key=key, default_value=_DEFAULT_MARKER_)
File "/home/james/miniconda3/envs/torch_ac/lib/python3.10/site-packages/omegaconf/dictconfig.py", line 442, in _get_impl
node = self._get_child(
File "/home/james/miniconda3/envs/torch_ac/lib/python3.10/site-packages/omegaconf/basecontainer.py", line 73, in _get_child
child = self._get_node(
File "/home/james/miniconda3/envs/torch_ac/lib/python3.10/site-packages/omegaconf/dictconfig.py", line 480, in _get_node
raise ConfigKeyError(f"Missing key {key!s}")
omegaconf.errors.ConfigKeyError: Missing key aws
full_key: aws
object_type=dict
Is there a way around this? I thought the -d
flag should ignore cluster info but it seems like AWS is being detected automatically.
Is there a way to run musicGen training by using Dora without setting up slurm cluster ?