Add support for bundled parameters

n-gao commented 3 years ago

Reference issue

At the moment, parameters in grid enable one to perform a complete grid search over all parameter ranges. However, in practice one may only change two parameters jointly.

What does this implement/fix?

This PR adds functionality that enables one to create parameter bundles in grid. All parameters within a zip must have the same number of possible values and are only changed jointly. Here an example .yaml file:

grid:
  attribute.test:
    type: choice
    options:
      - 1
      - 2
    zip_id: 2

  learning_rate:
    type: uniform
    min: 0.0
    max: 1
    num: 2
    zip_id: 2

  other_attribute:
    type: choice
    options:
      - True
      - False

The resulting configurations are:

[
    {
        'attribute': { 'test': 1},
        'learning_rate': 0.0,
        'other_attribute': True
    },
    {
        'attribute': { 'test': 1},
        'learning_rate': 0.0,
        'other_attribute': False
    },
    {
        'attribute': { 'test': 2},
        'learning_rate':1.0,
        'other_attribute': True
    },
    {
        'attribute': { 'test': 2},
        'learning_rate': 1.0,
        'other_attribute': False
    }
]

Additional information

This PR also includes a test case for bundled parameters.

gasteigerjo commented 3 years ago

I like the idea, but find the term "group" a little ambiguous. There are so many things that a "group" could refer to. Is there a better term? Something like "coupling"?

danielzuegner commented 3 years ago

I like the idea, but find the term "group" a little ambiguous. There are so many things that a "group" could refer to. Is there a better term? Something like "coupling"?

We also had the idea of calling the coupled parameters a bundle. What do you think? I agree that "group" is a bit general...

n-gao commented 3 years ago

Wait, why would UUIDs conflict with each other? They are meant to never conflict, or only in astronomically improbable cases.

You're definitely right, there's a chance of something like 10^-10 to 10^-11 of having two identical ones (even less if you consider that they are generated sequentially from the same series of PRNG). So in all practical uses, it is more or less useless to explicitly avoid identical uuids. But, to avoid confusion for that one user in 10billion I thought I might as well catch that explicitly :)

To cite Wikipedia:

Only after generating 1 billion UUIDs every second for the next 100 years, the probability of creating just one duplicate would be about 50%. Or, to put it another way, the probability of one duplicate would be about 50% if every person on earth owned 600 million UUIDs.

You never know, maybe seml is getting widespread adoption in a civilization with an orders of magnitude higher population.

gasteigerjo commented 3 years ago

We're using UUIDs in several other places in SEML, though. And relying on their defining uniqueness simplifies the code in all of those places.

gasteigerjo commented 3 years ago

I think I prefer bundle to group. The picture it invokes is better. Also, with bundle the user thinks this is something special and looks it up, instead of just assuming something.

And I think the term "coupling" is quite instructive and would be good to use in the docs.

gasteigerjo commented 3 years ago

Another option would be link. link: 1 (in the yaml) seems to give the right idea.

n-gao commented 3 years ago

I would prefer bundle, link sounds more like 1:1 mapping to me.

gasteigerjo commented 3 years ago

How do you handle cases where one parameter in the bundle has more options than another?

n-gao commented 3 years ago

How do you handle cases where one parameter in the bundle has more options than another?

we throw an exception.

gasteigerjo commented 3 years ago

I somehow think it'd be more intuitive to have the bundled parameters together in one place, instead of separated like this. Something like this:

grid:
  bundle:
    - attribute.test: 1
      learning_rate: 0
    - attribute.test: 2
      learning_rate: 1

  other_attribute:
    type: choice
    options:
      - True
      - False

Of course, then we can't use choice, uniform, etc. anymore. But do we want those in these cases anyway?

n-gao commented 3 years ago

This would definitely introduce a lot more code and arguably larger config files with less flexibility. I think it would be great to simply keep using choice, uniform, etc.

If we would use it as a top-level keyword

grid:
  dict:
    attribute1:
      type: choice
      options: [1, 2, 3, 4, 5, 6]
      bundle: bundle1
    attribute2:
      type: uniform
      min: 0
      max: 1
      num: 6
      bundle: bundle1

would become something like this:

grid:
  bundle1:
    - dict.attribute1: 1
      dict.attribute2: 0
    - dict.attribute1: 2
      dict.attribute2: 0.2
    - dict.attribute1: 3
      dict.attribute2: 0.4
    - dict.attribute1: 4
      dict.attribute2: 0.6
    - dict.attribute1: 5
      dict.attribute2: 0.8
    - dict.attribute1: 6
      dict.attribute2: 1

which, imho, is harder to read and maintain.

danielzuegner commented 3 years ago

A middle ground could be something like this:

grid:
  bundle:
    dict.learning_rate:
      type: uniform
      ...
    dict.other_param:
      type: choice
      ...

gasteigerjo commented 3 years ago

I guess my main thought is: What are the use cases for bundles? What is the most natural way of reflecting those use cases?

So I considered different model variants or datasets. In that case you have some parameters that are set to different values for each variant. And then it makes more sense to put all of those values in one place, instead of scattering them around your grid.

But that use case already seems to be covered well by our regular structure of sub-configs, so maybe you had something else in mind?

n-gao commented 3 years ago

This notation does not permit the splitting of bundled parameters across the "global" configuration and the sub configs, e.g.:

grid:
  dict:
    attribute1:
      type: choice
      options: [1, 2, 3, 4, 5, 6]
      bundle: bundle1

sub_config:
  grid:
    dict:
      attribute2:
        type: uniform
        min: 0
        max: 1
        num: 6
        bundle: bundle1

n-gao commented 3 years ago

In my experiments, I often see myself commenting out parts of the configuration, adding them to the collection, commenting out other parts, adding them to the collection, etc. because sub-configurations are already occupied by datasets.

An example would be if I change the optimizer and learning rate. Since I can't use subconfigs as these are reserved for my datasets because they do not fit into the grid notation. So I have to comment in and out 2 lines to test 2 optimizers with different learning rates.

Here's an actual example:

fixed:
  wfnet.ferminet_params.determinants: 32
  wfnet.ferminet_params.hidden_dims:
      [[512, 32, True], [512, 32, True], [512, 32, True], [512, 32, True]]
  training.max_steps: 200000
  training.patience: 10000

nitrogen:
  fixed:
    system:
      name: diatomic
      training:
        collection_type: dynamic
        n_configs: 16
        config:
          symbol1: N
          symbol2: N
          R:
            lower: 1.4
            upper: 6.5
            std: 0.05

      validation:
        collection_type: static
        n_configs: 16
        total_samples: 1000000
        config:
          symbol1: N
          symbol2: N
          R:
            - 1.60151
            - 1.70828
            - 1.81505
            - 1.92181
            - 2.02858
            - 2.13535
            - 2.24212
            - 2.34889
            - 2.45565
            - 2.56242
            - 2.66919
            - 2.77595
            - 2.88272
            - 2.98949
            - 3.09626
            - 3.20302
            - 3.30979
            - 3.41656
            - 3.52333
            - 3.63009
            - 3.73686
            - 3.84363
            - 3.95040
            - 4.05716
            - 4.16393
            - 4.27070
            - 4.37747
            - 4.48423
            - 4.59100
            - 4.69777
            - 4.80454
            - 4.91130
            - 5.01807
            - 5.12484
            - 5.23161
            - 5.33837
            - 5.87221
            - 6.40605

cyclobutadiene:
  fixed:
    system:
      name: cyclobutadiene
      training:
        collection_type: static
        config:
          state:
            - ground
            - transition

      validation:
        collection_type: static
        total_samples: 1000000
        config:
          state:
            - ground
            - transition

If I want to change training.max_steps and training.patience jointly there is no way around commenting them and in out.

With bundled parameters I could do:

...
grid:
  training.max_steps:
    - type: choice
    - options: [100000, 200000]
    - bundle: convergence
  training.patience: 
    - type: choice
    - options: [5000, 10000]
    - bundle: convergence
...

Currently, I could accomplish the same only by copying both sub-collection blocks with different training parameters.

gasteigerjo commented 3 years ago

So in this case it seems like you would really want the ability of combining sub-configs.

gasteigerjo commented 3 years ago

One way of giving more control over how to handle subconfigs would be something like this:

# This now just defines subconfigs, but we don't do anything with them.
nitrogen:
  ...
cyclobutadiene:
  ...
short_training:
  ...
long_training:
  ...

# Now the user themself has to define how to use those subconfigs
grid:
  config:
    - nitrogen
    - cyclobutadiene
  config:
    - short_training
    - long_training

  other_attribute:
    type: choice
    options:
      - True
      - False

This still precludes the ability of using ranges etc, but I'm not really convinced that a range is what you want here in our use cases. Also, coupling ranges seems rather error-prone.

BUT, you can now define a grid in each subconfig, which would give you the power to do other things.

One problem with this would be that there isn't a clear hierarchy of configs anymore. So if you have duplicate parameters across subconfigs there's no clear way of resolving it.

n-gao commented 3 years ago

That seems to be quite hard to get right because as you've said then there are suddenly subconfigs which only permit fixed if you wanna use them in grid.

I don't see a large downside to the implementation in this PR, its upsides are:

it is completely optional and all current configs are still valid
it's just a few lines of code (both in seml as well in the actual config files)
it still permits all the nice choice, uniform and loguniform iteration features
it is very easy to check for user errors and report clear advice on what went wrong

My main objection with introducing a new keyword or using grid over configs is that it is much more error-prone and would require significantly more refactoring of the codebase. Catching all the ways one could mishandle these systems seems to be much harder than simply checking whether all parameters in a bundle have the same length.

gasteigerjo commented 3 years ago

The current way of implementing it is easier, absolutely. But it seems like a band-aid for a problem that would need a real, clean solution. The real solution certainly takes more effort. But then people don't have to use awkward definitions just because we were stuck in the current frame of thinking when implementing it. I really think we should take a step back and try to find a cleaner way for this.

gasteigerjo commented 3 years ago

I think we can improve the solution I've suggested above by keeping the subconfigs we already have, and adding the ability for the user to define parameter bundles, which they can later use in their grids etc. Bundles would work like I've suggested above and only contain fixed parameters.

Something like this:

# Define parameter groups/bundles, to use them later
def:
  nitrogen:
    ...
  cyclobutadiene:
    ...
  short_training:
    ...
  long_training:
    ...
  model1:
    ...
  model2:
    ...

# The user has to specify how to use the parameter groups/bundles 
fixed:
  model1: load

grid:
  dataset:  # The name doesn't do anything, but it has to be unique (due to yaml)
    type: load
    options:
      - nitrogen
      - cyclobutadiene
  training:
    type: load
    options:
      - short_training
      - long_training

  other_attribute:
    type: choice
    options:
      - True
      - False

We'd raise an error if bundles contain conflicting parameters (within and across). Implementation-wise this means that we simply have to exchange nitrogen for whatever the user defined, before doing all the checks.

Hmm, but it might actually be nice to define full subconfigs (with fixed/grid/random), and then combine them at will later on. Currently, we assume that each subconfig is independent. So instead, we would now say that the user defines how the subconfigs are used or combined. Searching for conflicts might not be that hard, actually.

n-gao commented 2 years ago

@klicperajo @danielzuegner what do we do with this PR? Should we merge it or simply keep it open?

Regarding naming, it's called zip now.

danielzuegner commented 2 years ago

@klicperajo @danielzuegner what do we do with this PR? Should we merge it or simply keep it open?

Regarding naming, it's called zip now.

I kind of liked the idea of a more "full-fledged" overhaul in order to support groups/bundles as in the later comment, but I'm not sure if anyone has time to implement it soon-ish. If not, I'm fine with going with zip for now as it definitely adds useful features to seml. Let me know what you think.

gasteigerjo commented 2 years ago

I'll have time to look into the larger overhaul in March. Would you have time to collaborate on this, @n-gao?

I'd prefer not to merge this until then.

n-gao commented 2 years ago

Sure let's postpone the merge until we have a clearer roadmap on the larger overhaul. I am happy to collaborate on this.

TUM-DAML / seml