Open slerman12 opened 3 years ago
Hi, This is related to the interpolations you have in your yaml file:
...
dir: ./exp/${now:%Y.%m.%d}/${now:%H%M}_${agent_cfg.experiment}
...
submitit_folder: ./exp/${now:%Y.%m.%d}/${now:%H%M%S}_${agent_cfg.experiment}/.slurm
...
The "${agent_cfg.experiment}"
syntax is used to reference another node in your config tree... but I don't see an agent_cfg
key anywhere in your yaml file.
You can learn more about how interpolations work from the OmegaConf docs.
Take a look at your generated config to see that you have a node agent_cfg.experiment
. Something like:
$ python run.py --cfg job --config-name atari-slurm
Here is the output. I'm not seeing agent_cfg.experiment
, but that config works with a submitit_local
request.
atari
frame_stack: 4
action_repeat: 4
discount: 0.99
num_train_frames: 100001
num_seed_frames: 1600
max_episode_frames: 27000
truncate_episode_frames: 400
eval_every_frames: 20000
num_eval_episodes: 10
save_snapshot: false
replay_buffer_size: ${num_train_frames}
replay_buffer_num_workers: 4
prioritized_replay: false
prioritized_replay_alpha: 0.6
nstep: 10
batch_size: 32
seed: 1
device: cuda
save_video: true
save_train_video: false
use_tb: true
experiment: exp
lr: 0.0001
adam_eps: 0.00015
max_grad_norm: 10.0
feature_dim: 50
agent:
_target_: agents.Agent
obs_shape: ???
action_shape: ???
discrete: ???
device: ${device}
lr: ${lr}
adam_eps: ${adam_eps}
max_grad_norm: ${max_grad_norm}
critic_target_tau: 0.01
min_eps: 0.1
num_seed_frames: ${num_seed_frames}
intensity_scale: 0.05
double_q: true
dueling: true
use_tb: ${use_tb}
num_expl_steps: 5000
hidden_dim: 512
feature_dim: ${feature_dim}
prioritized_replay_beta0: 0.4
prioritized_replay_beta_steps: ${num_train_frames}
stddev_schedule: linear(1.0,0.1,500000)
task_name: Breakout
I can't comment on why you think it works in some other mode. It will not work if you are missing that config node.
Let me show you two configs.
Here, running python run.py --config-name atari
works:
defaults:
- _self_
- task@_global_: atari/pong
- override hydra/launcher: submitit_local
# environments/domain
envs: atari
# task settings
frame_stack: 4
action_repeat: 4
## see section 4.1 in https://arxiv.org/pdf/1812.06110.pdf
#terminal_on_life_loss: true # true by default
discount: 0.99
# train settings
num_train_frames: 1000001
num_seed_frames: 1600 # should be >= replay_buffer_num_workers * truncate_episode_len
#num_seed_frames: 4004 # should be >= replay_buffer_num_workers * truncate_episode_len + action_repeat ?
#num_seed_frames: 12000
#num_exploration_steps: 5000
max_episode_frames: 27000 # must be > update_every_steps, >= nstep - 1
truncate_episode_frames: 400
#truncate_episode_len: false
# eval
#eval_every_frames: 100000
#num_eval_episodes: 10 # would this take too long in atari?
eval_every_frames: 20000
num_eval_episodes: 10
# snapshot
save_snapshot: false
# replay buffer
replay_buffer_size: ${num_train_frames}
#store_every_frames: 1000 # should be below seed frames I think
#store_every_frames: false
#replay_buffer_num_workers: 2
replay_buffer_num_workers: 4
prioritized_replay: false
prioritized_replay_alpha: 0.6
nstep: 10
#batch_size: 256
batch_size: 32
# misc
seed: 1
#device: cpu
device: cuda
save_video: true
save_train_video: false
use_tb: true
# experiment
experiment: exp
# agent
lr: 1e-4
adam_eps: 0.00015
max_grad_norm: 10.0
feature_dim: 50
agent:
_target_: agents.Agent
obs_shape: ??? # to be specified later
action_shape: ??? # to be specified later
discrete: ??? # to be specified later
device: ${device}
lr: ${lr}
adam_eps: ${adam_eps}
max_grad_norm: ${max_grad_norm}
critic_target_tau: 0.01
min_eps: 0.1
num_seed_frames: ${num_seed_frames}
# critic_target_update_frequency: 1
# critic_target_tau: 1.0
intensity_scale: 0.05
double_q: true
dueling: true
# update_every_steps: 2
use_tb: ${use_tb}
# num_expl_steps: 2000
num_expl_steps: 5000
# num_expl_steps: 20000
# hidden_dim: 1024
hidden_dim: 512
feature_dim: ${feature_dim}
prioritized_replay_beta0: 0.4
prioritized_replay_beta_steps: ${num_train_frames}
hydra:
run:
dir: ./exp_local/${now:%Y.%m.%d}/${now:%H%M%S}_${hydra.job.override_dirname}
sweep:
dir: ./exp/${now:%Y.%m.%d}/${now:%H%M}_${agent_cfg.experiment}
subdir: ${hydra.job.num}
launcher:
timeout_min: 4300
# cpus_per_task: 10
cpus_per_task: 1
# gpus_per_node: 1
gpus_per_node: 0
tasks_per_node: 1
mem_gb: 160
nodes: 1
submitit_folder: ./exp/${now:%Y.%m.%d}/${now:%H%M%S}_${agent_cfg.experiment}/.slurm
Here, the one I was referring to above, running python run.py --multirun --config-name atari-slurm seed=1,2,3,4,5
results in that error:
defaults:
- _self_
- task@_global_: atari/breakout
- override hydra/launcher: submitit_slurm
# environments/domain
envs: atari
# task settings
frame_stack: 4
action_repeat: 4
## see section 4.1 in https://arxiv.org/pdf/1812.06110.pdf
#terminal_on_life_loss: true # true by default
discount: 0.99
# train settings
num_train_frames: 1000001
num_seed_frames: 1600 # should be >= replay_buffer_num_workers * truncate_episode_len
#num_seed_frames: 4004 # should be >= replay_buffer_num_workers * truncate_episode_len + action_repeat ?
#num_seed_frames: 12000
#num_exploration_steps: 5000
max_episode_frames: 27000 # must be > update_every_steps, >= nstep - 1
truncate_episode_frames: 400
#truncate_episode_len: false
# eval
#eval_every_frames: 100000
#num_eval_episodes: 10 # would this take too long in atari?
eval_every_frames: 20000
num_eval_episodes: 10
# snapshot
save_snapshot: false
# replay buffer
replay_buffer_size: ${num_train_frames}
#store_every_frames: 1000 # should be below seed frames I think
#store_every_frames: false
#replay_buffer_num_workers: 2
replay_buffer_num_workers: 4
prioritized_replay: false
prioritized_replay_alpha: 0.6
nstep: 10
#batch_size: 256
batch_size: 32
# misc
seed: 1
#device: cpu
device: cuda
save_video: true
save_train_video: false
use_tb: true
# experiment
experiment: exp
# agent
lr: 1e-4
adam_eps: 0.00015
max_grad_norm: 10.0
feature_dim: 50
agent:
_target_: agents.Agent
obs_shape: ??? # to be specified later
action_shape: ??? # to be specified later
discrete: ??? # to be specified later
device: ${device}
lr: ${lr}
adam_eps: ${adam_eps}
max_grad_norm: ${max_grad_norm}
critic_target_tau: 0.01
min_eps: 0.1
num_seed_frames: ${num_seed_frames}
# critic_target_update_frequency: 1
# critic_target_tau: 1.0
intensity_scale: 0.05
double_q: true
dueling: true
# update_every_steps: 2
use_tb: ${use_tb}
# num_expl_steps: 2000
num_expl_steps: 5000
# num_expl_steps: 20000
# hidden_dim: 1024
hidden_dim: 512
feature_dim: ${feature_dim}
prioritized_replay_beta0: 0.4
prioritized_replay_beta_steps: ${num_train_frames}
hydra:
run:
dir: ./exp_local/${now:%Y.%m.%d}/${now:%H%M%S}_${hydra.job.override_dirname}
sweep:
dir: ./exp/${now:%Y.%m.%d}/${now:%H%M}_${agent_cfg.experiment}
subdir: ${hydra.job.num}
launcher:
timeout_min: 4300
cpus_per_task: 4
gpus_per_node: 4
tasks_per_node: 4
mem_gb: 160
nodes: 2
partition: gpu
# gres: gpu:4
cpus_per_gpu: 16
gpus_per_task: 1
constraint: K80
# mem_per_gpu: null
# mem_per_cpu: null
submitit_folder: ./exp/${now:%Y.%m.%d}/${now:%H%M%S}_${agent_cfg.experiment}/.slurm
sweep.dir
will only be accessed at multirun, that is why RUN works for you but MULTIRUN does not.
As omry mentioned, you do not have a agent_cfg.experiment node in your config. Did you mean the experiment
node? (that node does exist.)
I'm confused, why is the parser looking for "agent_cfg.experiment" when I do multi-run?
Oh oh, I see. Sorry, I did not realize sweep.dir included that. My bad, okay, let me try without that.
Okay, it made progress. I'm getting this error now. Do you know what it means?
raise FailedJobError(stderr) from subprocess_error
submitit.core.utils.FailedJobError: sbatch: error: Batch job submission failed: Invalid wckey specification
this is a slurm error - this issue might help https://github.com/facebookincubator/submitit/issues/1632
Thanks. They recommend passing in wckey=""
as an argument. Should I just include wckey: none
in the hydra.launcher config?
Hi sorry, neither repo has gotten back to me about how to disable wckey
.
hey @slerman12 sorry for getting back to you late!
wckey is not a supported Hydra submitit launcher config - would you be able to try adding that to the launcher config and see if it helps in your environment?
adding a config to launcher is easy, an example here https://github.com/facebookresearch/hydra/commit/f17eef4391ba18e4fcbb835fc175cbf180c28b7f
Hmm, that didn't seem to fix it. When I run python run.py --multirun --config-name atari-slurm seed=1,2
, I still getsbatch: error: Batch job submission failed: Invalid wckey specification
.
Here is what I did:
vim /python3.8/site-packages/hydra_plugins/hydra_submitit_launcher/config.py
And on line 28, I added:
wckey: str = ""
could you share what launcher configs you had for your application?
python run.py --cfg hydra -p hydra.launcher
Yes, I ran python run.py --config-name atari-slurm --cfg hydra -p hydra.launcher
, and got:
`Key 'gres' not in 'SlurmQueueConf'
full_key: hydra.launcher.gres
object_type=SlurmQueueConf
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.`
I'm not sure, I might've changed something since we last talked.
gres
support is recently added, if you are modifying the launcher code locally ( to test out adding wckey
), you need to sync your code to the latest and install from source.
Looks like I'm running the latest release...
$ pip show hydra-core
Name: hydra-core
Version: 1.1.0
Am I missing something?
hydra-submitit-launcher is a separate package :)
Just ran: pip install hydra-submitit-launcher —upgrade
Then: python run.py --config-name atari-slurm --cfg hydra -p hydra.launcher
# @package hydra.launcher
submitit_folder: ./exp/${now:%Y.%m.%d}/${now:%H%M%S}_${agent._target_}_${experiment}/.slurm
timeout_min: 4300
cpus_per_task: 4
gpus_per_node: null
tasks_per_node: 4
mem_gb: 20
nodes: 2
name: ${hydra.job.name}
_target_: hydra_plugins.hydra_submitit_launcher.submitit_launcher.SlurmLauncher
partition: gpu
qos: null
comment: null
constraint: K80
exclude: null
gres: gpu:1
cpus_per_gpu: null
gpus_per_task: null
mem_per_gpu: null
mem_per_cpu: null
signal_delay_s: 120
max_num_timeout: 0
additional_parameters: {}
array_parallelism: 256
setup: null
Then, after:
vim /python3.8/site-packages/hydra_plugins/hydra_submitit_launcher/config.py
And on line 28: wckey: str = ""
python run.py --config-name atari-slurm --cfg hydra -p hydra.launcher
This output:
# @package hydra.launcher
submitit_folder: ./exp/${now:%Y.%m.%d}/${now:%H%M%S}_${agent._target_}_${experiment}/.slurm
timeout_min: 4300
cpus_per_task: 4
gpus_per_node: null
tasks_per_node: 4
mem_gb: 20
nodes: 2
name: ${hydra.job.name}
wckey: ''
_target_: hydra_plugins.hydra_submitit_launcher.submitit_launcher.SlurmLauncher
partition: gpu
qos: null
comment: null
constraint: K80
exclude: null
gres: gpu:1
cpus_per_gpu: null
gpus_per_task: null
mem_per_gpu: null
mem_per_cpu: null
signal_delay_s: 120
max_num_timeout: 0
additional_parameters: {}
array_parallelism: 256
setup: null
can you try passing additional_parameters: {"wc_key": ""}
?
I try to sweep a set of hyperparams using the slurm submitit plugin.
I run:
python run.py --multirun --config-name atari-slurm seed=1,2,3,4,5
And my config file looks something like this:
However, I get this error:
*I'm running this on a slurm cluster where we ordinarily use sbatch to submit jobs such as a multi-gpu job like the one defined in the config above.