facebookresearch / hydra

Hydra is a framework for elegantly configuring complex applications
https://hydra.cc
MIT License
8.6k stars 618 forks source link

Error out on attempt to sweep on the Hydra configuration #2381

Open Jasha10 opened 1 year ago

Jasha10 commented 1 year ago

Undefined behavior results from attempting a multirun sweep where the sweep parameters would cause changes to the Hydra config. The undefined behavior results in issues such as https://github.com/facebookresearch/hydra/discussions/2329.

If possible, we should try to detect when sweep parameters would modify the Hydra config and error out under such circumstances.

Based on comment posted by @omry in https://github.com/facebookresearch/hydra/discussions/2329#discussioncomment-3638025

Queuecumber commented 1 year ago

I think the better thing to do would be to try to support this in some way, it may need to be limited

Consider the following experiment that I'm trying to run where I encounter this bug.

# @package _global_
hydra:
  mode: MULTIRUN
  sweeper:
    params:
      extra_params: "{batch_size:1,num_nodes:1},{batch_size:3,num_nodes:2}"
trainer:
  logger:
    group: Batch Size
  num_nodes: ${extra_params.num_nodes}

dataset:
  batch_size: ${extra_params.batch_size}  

name: batch-size/batch-of-${dataset.batch_size}-nodes-${trainer.num_nodes}

All I'm trying to do here is compare using a small batch (1 batch per GPU on 1 machine) to a large batch (3 batches per GPU on 2 machines). This seems like a simple thing to try.

The submitit launcher is set to launch with a number of nodes equal to trainer.num_nodes which in turn is tied to the sweep parameter extra_params.num_nodes.

It seem perfectly reasonable to me that hydra should be able to launch job A with 1 node and job B with 2 nodes, and yet this causes hydra to crash. I get why it's crashing it just seems like the kind of thing that should work.

Thoughts?

Jasha10 commented 1 year ago

Hi @Queuecumber,

I believe what you've described is actually different from this issue. This issue is about sweeps that modify the /hydra config group specifically. The example you gave above sweeps over extra_params, not over hydra.

The example you've given above should work if you make one slight modification: replace extra_params: "{batch_size:1,num_nodes:1},{batch_size:3,num_nodes:2}" with "+extra_params": "{batch_size:1,num_nodes:1},{batch_size:3,num_nodes:2}". I was able to get your example to run after making that change.

Queuecumber commented 1 year ago

This wasn't a minimal example, as I said extra_params.num_nodes is used to configure the number nodes used by the submitit launcher.

Removing all the interpolations that are happening would be closer to this:

# @package _global_
defaults:
 - override /hydra/launcher: submitit_slurm

hydra:
  mode: MULTIRUN
  sweeper:
    params:
      hydra.launcher.nodes: 1,2
Jasha10 commented 1 year ago

I see.

I think the better thing to do would be to try to support this in some way, it may need to be limited

I agree that sweeping over the hydra config would be nice to have. That being said, erroring explicitly would certainly be better than the current behavior, which is undefined, and an error check will probably be much easier to implement for the short term.