SLURM sweep with hydra - Githubissues

slerman12 commented 3 years ago

I try to sweep a set of hyperparams using the slurm submitit plugin.

I run:

python run.py --multirun --config-name atari-slurm seed=1,2,3,4,5

And my config file looks something like this:

defaults:
  - _self_
  - task@_global_: atari/breakout
  - override hydra/launcher: submitit_slurm

seed: 1

hydra:
  run:
    dir: ./exp_local/${now:%Y.%m.%d}/${now:%H%M%S}_${hydra.job.override_dirname}
  sweep:
    dir: ./exp/${now:%Y.%m.%d}/${now:%H%M}_${agent_cfg.experiment}
    subdir: ${hydra.job.num}
  launcher:
    timeout_min: 4300
    cpus_per_task: 4
    gpus_per_node: 4
    tasks_per_node: 4
    mem_gb: 160
    nodes: 2
    partition: gpu
#    gres: gpu:4
    cpus_per_gpu: 16
    gpus_per_task: 1
    constraint: K80
#    mem_per_gpu: null
#    mem_per_cpu: null
    submitit_folder: ./exp/${now:%Y.%m.%d}/${now:%H%M%S}_${agent_cfg.experiment}/.slurm

However, I get this error:

raise InterpolationKeyError(f"Interpolation key '{inter_key}' not found")
omegaconf.errors.InterpolationKeyError: Interpolation key 'agent_cfg.experiment' not found

*I'm running this on a slurm cluster where we ordinarily use sbatch to submit jobs such as a multi-gpu job like the one defined in the config above.

Jasha10 commented 3 years ago

Hi, This is related to the interpolations you have in your yaml file:

...
    dir: ./exp/${now:%Y.%m.%d}/${now:%H%M}_${agent_cfg.experiment}
...
    submitit_folder: ./exp/${now:%Y.%m.%d}/${now:%H%M%S}_${agent_cfg.experiment}/.slurm
...

The "${agent_cfg.experiment}" syntax is used to reference another node in your config tree... but I don't see an agent_cfg key anywhere in your yaml file.

You can learn more about how interpolations work from the OmegaConf docs.

omry commented 3 years ago

Take a look at your generated config to see that you have a node agent_cfg.experiment. Something like:

$ python run.py --cfg job --config-name atari-slurm

slerman12 commented 3 years ago

Here is the output. I'm not seeing agent_cfg.experiment, but that config works with a submitit_local request.

atari                                                                           
frame_stack: 4                                                                  
action_repeat: 4                                                                
discount: 0.99                                                                  
num_train_frames: 100001                        
num_seed_frames: 1600
max_episode_frames: 27000
truncate_episode_frames: 400
eval_every_frames: 20000
num_eval_episodes: 10
save_snapshot: false
replay_buffer_size: ${num_train_frames}
replay_buffer_num_workers: 4
prioritized_replay: false
prioritized_replay_alpha: 0.6
nstep: 10
batch_size: 32
seed: 1
device: cuda
save_video: true
save_train_video: false
use_tb: true
experiment: exp
lr: 0.0001
adam_eps: 0.00015
max_grad_norm: 10.0
feature_dim: 50
agent:
  _target_: agents.Agent
  obs_shape: ???
  action_shape: ???
  discrete: ???
  device: ${device}
  lr: ${lr}
  adam_eps: ${adam_eps}
  max_grad_norm: ${max_grad_norm}
  critic_target_tau: 0.01
  min_eps: 0.1
  num_seed_frames: ${num_seed_frames}
  intensity_scale: 0.05
  double_q: true
  dueling: true
  use_tb: ${use_tb}
  num_expl_steps: 5000
  hidden_dim: 512
  feature_dim: ${feature_dim}
  prioritized_replay_beta0: 0.4
  prioritized_replay_beta_steps: ${num_train_frames}
stddev_schedule: linear(1.0,0.1,500000)
task_name: Breakout

omry commented 3 years ago

I can't comment on why you think it works in some other mode. It will not work if you are missing that config node.

slerman12 commented 3 years ago

Let me show you two configs.

Here, running python run.py --config-name atari works:

defaults:
  - _self_
  - task@_global_: atari/pong
  - override hydra/launcher: submitit_local

# environments/domain
envs: atari
# task settings
frame_stack: 4
action_repeat: 4
## see section 4.1 in https://arxiv.org/pdf/1812.06110.pdf
#terminal_on_life_loss: true  # true by default
discount: 0.99
# train settings
num_train_frames:  1000001
num_seed_frames: 1600  # should be >= replay_buffer_num_workers * truncate_episode_len
#num_seed_frames: 4004  # should be >= replay_buffer_num_workers * truncate_episode_len + action_repeat ?
#num_seed_frames: 12000
#num_exploration_steps: 5000
max_episode_frames: 27000  # must be > update_every_steps, >= nstep - 1
truncate_episode_frames: 400
#truncate_episode_len: false
# eval
#eval_every_frames: 100000
#num_eval_episodes: 10  # would this take too long in atari?
eval_every_frames: 20000
num_eval_episodes: 10
# snapshot
save_snapshot: false
# replay buffer
replay_buffer_size: ${num_train_frames}
#store_every_frames: 1000  # should be below seed frames I think
#store_every_frames: false
#replay_buffer_num_workers: 2
replay_buffer_num_workers: 4
prioritized_replay: false
prioritized_replay_alpha: 0.6
nstep: 10
#batch_size: 256
batch_size: 32
# misc
seed: 1
#device: cpu
device: cuda
save_video: true
save_train_video: false
use_tb: true
# experiment
experiment: exp
# agent
lr: 1e-4
adam_eps: 0.00015
max_grad_norm: 10.0
feature_dim: 50

agent:
  _target_: agents.Agent
  obs_shape: ??? # to be specified later
  action_shape: ??? # to be specified later
  discrete: ??? # to be specified later
  device: ${device}
  lr: ${lr}
  adam_eps: ${adam_eps}
  max_grad_norm: ${max_grad_norm}
  critic_target_tau: 0.01
  min_eps: 0.1
  num_seed_frames: ${num_seed_frames}
#  critic_target_update_frequency: 1
#  critic_target_tau: 1.0
  intensity_scale: 0.05
  double_q: true
  dueling: true
#  update_every_steps: 2
  use_tb: ${use_tb}
#  num_expl_steps: 2000
  num_expl_steps: 5000
#  num_expl_steps: 20000
#  hidden_dim: 1024
  hidden_dim: 512
  feature_dim: ${feature_dim}
  prioritized_replay_beta0: 0.4
  prioritized_replay_beta_steps: ${num_train_frames}

hydra:
  run:
    dir: ./exp_local/${now:%Y.%m.%d}/${now:%H%M%S}_${hydra.job.override_dirname}
  sweep:
    dir: ./exp/${now:%Y.%m.%d}/${now:%H%M}_${agent_cfg.experiment}
    subdir: ${hydra.job.num}
  launcher:
    timeout_min: 4300
#    cpus_per_task: 10
    cpus_per_task: 1
#    gpus_per_node: 1
    gpus_per_node: 0
    tasks_per_node: 1
    mem_gb: 160
    nodes: 1
    submitit_folder: ./exp/${now:%Y.%m.%d}/${now:%H%M%S}_${agent_cfg.experiment}/.slurm

Here, the one I was referring to above, running python run.py --multirun --config-name atari-slurm seed=1,2,3,4,5 results in that error:

defaults:
  - _self_
  - task@_global_: atari/breakout
  - override hydra/launcher: submitit_slurm

# environments/domain
envs: atari
# task settings
frame_stack: 4
action_repeat: 4
## see section 4.1 in https://arxiv.org/pdf/1812.06110.pdf
#terminal_on_life_loss: true  # true by default
discount: 0.99
# train settings
num_train_frames:  1000001
num_seed_frames: 1600  # should be >= replay_buffer_num_workers * truncate_episode_len
#num_seed_frames: 4004  # should be >= replay_buffer_num_workers * truncate_episode_len + action_repeat ?
#num_seed_frames: 12000
#num_exploration_steps: 5000
max_episode_frames: 27000  # must be > update_every_steps, >= nstep - 1
truncate_episode_frames: 400
#truncate_episode_len: false
# eval
#eval_every_frames: 100000
#num_eval_episodes: 10  # would this take too long in atari?
eval_every_frames: 20000
num_eval_episodes: 10
# snapshot
save_snapshot: false
# replay buffer
replay_buffer_size: ${num_train_frames}
#store_every_frames: 1000  # should be below seed frames I think
#store_every_frames: false
#replay_buffer_num_workers: 2
replay_buffer_num_workers: 4
prioritized_replay: false
prioritized_replay_alpha: 0.6
nstep: 10
#batch_size: 256
batch_size: 32
# misc
seed: 1
#device: cpu
device: cuda
save_video: true
save_train_video: false
use_tb: true
# experiment
experiment: exp
# agent
lr: 1e-4
adam_eps: 0.00015
max_grad_norm: 10.0
feature_dim: 50

agent:
  _target_: agents.Agent
  obs_shape: ??? # to be specified later
  action_shape: ??? # to be specified later
  discrete: ??? # to be specified later
  device: ${device}
  lr: ${lr}
  adam_eps: ${adam_eps}
  max_grad_norm: ${max_grad_norm}
  critic_target_tau: 0.01
  min_eps: 0.1
  num_seed_frames: ${num_seed_frames}
  #  critic_target_update_frequency: 1
  #  critic_target_tau: 1.0
  intensity_scale: 0.05
  double_q: true
  dueling: true
  #  update_every_steps: 2
  use_tb: ${use_tb}
  #  num_expl_steps: 2000
  num_expl_steps: 5000
  #  num_expl_steps: 20000
  #  hidden_dim: 1024
  hidden_dim: 512
  feature_dim: ${feature_dim}
  prioritized_replay_beta0: 0.4
  prioritized_replay_beta_steps: ${num_train_frames}

hydra:
  run:
    dir: ./exp_local/${now:%Y.%m.%d}/${now:%H%M%S}_${hydra.job.override_dirname}
  sweep:
    dir: ./exp/${now:%Y.%m.%d}/${now:%H%M}_${agent_cfg.experiment}
    subdir: ${hydra.job.num}
  launcher:
    timeout_min: 4300
    cpus_per_task: 4
    gpus_per_node: 4
    tasks_per_node: 4
    mem_gb: 160
    nodes: 2
    partition: gpu
#    gres: gpu:4
    cpus_per_gpu: 16
    gpus_per_task: 1
    constraint: K80
#    mem_per_gpu: null
#    mem_per_cpu: null
    submitit_folder: ./exp/${now:%Y.%m.%d}/${now:%H%M%S}_${agent_cfg.experiment}/.slurm

jieru-hu commented 3 years ago

sweep.dir will only be accessed at multirun, that is why RUN works for you but MULTIRUN does not.

As omry mentioned, you do not have a agent_cfg.experiment node in your config. Did you mean the experiment node? (that node does exist.)

slerman12 commented 3 years ago

I'm confused, why is the parser looking for "agent_cfg.experiment" when I do multi-run?

slerman12 commented 3 years ago

Oh oh, I see. Sorry, I did not realize sweep.dir included that. My bad, okay, let me try without that.

slerman12 commented 3 years ago

Okay, it made progress. I'm getting this error now. Do you know what it means?

raise FailedJobError(stderr) from subprocess_error
submitit.core.utils.FailedJobError: sbatch: error: Batch job submission failed: Invalid wckey specification

jieru-hu commented 3 years ago

this is a slurm error - this issue might help https://github.com/facebookincubator/submitit/issues/1632

slerman12 commented 3 years ago

Thanks. They recommend passing in wckey="" as an argument. Should I just include wckey: none in the hydra.launcher config?

slerman12 commented 3 years ago

Hi sorry, neither repo has gotten back to me about how to disable wckey.

jieru-hu commented 3 years ago

hey @slerman12 sorry for getting back to you late!

wckey is not a supported Hydra submitit launcher config - would you be able to try adding that to the launcher config and see if it helps in your environment?

adding a config to launcher is easy, an example here https://github.com/facebookresearch/hydra/commit/f17eef4391ba18e4fcbb835fc175cbf180c28b7f

slerman12 commented 3 years ago

Hmm, that didn't seem to fix it. When I run python run.py --multirun --config-name atari-slurm seed=1,2, I still getsbatch: error: Batch job submission failed: Invalid wckey specification.

Here is what I did:

vim /python3.8/site-packages/hydra_plugins/hydra_submitit_launcher/config.py

And on line 28, I added:

wckey: str = ""

jieru-hu commented 3 years ago

could you share what launcher configs you had for your application?

python run.py --cfg hydra -p hydra.launcher

slerman12 commented 3 years ago

Yes, I ran python run.py --config-name atari-slurm --cfg hydra -p hydra.launcher, and got:

`Key 'gres' not in 'SlurmQueueConf'
    full_key: hydra.launcher.gres
    object_type=SlurmQueueConf

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.`

I'm not sure, I might've changed something since we last talked.

jieru-hu commented 3 years ago

gres support is recently added, if you are modifying the launcher code locally ( to test out adding wckey), you need to sync your code to the latest and install from source.

slerman12 commented 3 years ago

Looks like I'm running the latest release...

$ pip show hydra-core
Name: hydra-core
Version: 1.1.0

Am I missing something?

jieru-hu commented 3 years ago

hydra-submitit-launcher is a separate package :)

slerman12 commented 3 years ago

Just ran: pip install hydra-submitit-launcher —upgrade

Then: python run.py --config-name atari-slurm --cfg hydra -p hydra.launcher

# @package hydra.launcher
submitit_folder: ./exp/${now:%Y.%m.%d}/${now:%H%M%S}_${agent._target_}_${experiment}/.slurm
timeout_min: 4300
cpus_per_task: 4
gpus_per_node: null
tasks_per_node: 4
mem_gb: 20
nodes: 2
name: ${hydra.job.name}
_target_: hydra_plugins.hydra_submitit_launcher.submitit_launcher.SlurmLauncher
partition: gpu
qos: null
comment: null
constraint: K80
exclude: null
gres: gpu:1
cpus_per_gpu: null
gpus_per_task: null
mem_per_gpu: null
mem_per_cpu: null
signal_delay_s: 120
max_num_timeout: 0
additional_parameters: {}
array_parallelism: 256
setup: null

slerman12 commented 3 years ago

Then, after:

vim /python3.8/site-packages/hydra_plugins/hydra_submitit_launcher/config.py

And on line 28: wckey: str = ""

python run.py --config-name atari-slurm --cfg hydra -p hydra.launcher

This output:

# @package hydra.launcher
submitit_folder: ./exp/${now:%Y.%m.%d}/${now:%H%M%S}_${agent._target_}_${experiment}/.slurm
timeout_min: 4300
cpus_per_task: 4
gpus_per_node: null
tasks_per_node: 4
mem_gb: 20
nodes: 2
name: ${hydra.job.name}
wckey: ''
_target_: hydra_plugins.hydra_submitit_launcher.submitit_launcher.SlurmLauncher
partition: gpu
qos: null
comment: null
constraint: K80
exclude: null
gres: gpu:1
cpus_per_gpu: null
gpus_per_task: null
mem_per_gpu: null
mem_per_cpu: null
signal_delay_s: 120
max_num_timeout: 0
additional_parameters: {}
array_parallelism: 256
setup: null

gwenzek commented 2 years ago

can you try passing additional_parameters: {"wc_key": ""} ?

facebookresearch / hydra

SLURM sweep with hydra #1859