Closed n-gao closed 2 years ago
I like the idea, but find the term "group" a little ambiguous. There are so many things that a "group" could refer to. Is there a better term? Something like "coupling"?
I like the idea, but find the term "group" a little ambiguous. There are so many things that a "group" could refer to. Is there a better term? Something like "coupling"?
We also had the idea of calling the coupled parameters a bundle
. What do you think? I agree that "group" is a bit general...
Wait, why would UUIDs conflict with each other? They are meant to never conflict, or only in astronomically improbable cases.
You're definitely right, there's a chance of something like 10^-10 to 10^-11 of having two identical ones (even less if you consider that they are generated sequentially from the same series of PRNG). So in all practical uses, it is more or less useless to explicitly avoid identical uuids. But, to avoid confusion for that one user in 10billion I thought I might as well catch that explicitly :)
To cite Wikipedia:
Only after generating 1 billion UUIDs every second for the next 100 years, the probability of creating just one duplicate would be about 50%. Or, to put it another way, the probability of one duplicate would be about 50% if every person on earth owned 600 million UUIDs.
You never know, maybe seml
is getting widespread adoption in a civilization with an orders of magnitude higher population.
We're using UUIDs in several other places in SEML, though. And relying on their defining uniqueness simplifies the code in all of those places.
I think I prefer bundle
to group
. The picture it invokes is better. Also, with bundle
the user thinks this is something special and looks it up, instead of just assuming something.
And I think the term "coupling" is quite instructive and would be good to use in the docs.
Another option would be link
. link: 1
(in the yaml) seems to give the right idea.
I would prefer bundle
, link sounds more like 1:1 mapping to me.
How do you handle cases where one parameter in the bundle has more options than another?
How do you handle cases where one parameter in the bundle has more options than another?
we throw an exception.
I somehow think it'd be more intuitive to have the bundled parameters together in one place, instead of separated like this. Something like this:
grid:
bundle:
- attribute.test: 1
learning_rate: 0
- attribute.test: 2
learning_rate: 1
other_attribute:
type: choice
options:
- True
- False
Of course, then we can't use choice, uniform, etc. anymore. But do we want those in these cases anyway?
This would definitely introduce a lot more code and arguably larger config files with less flexibility. I think it would be great to simply keep using choice
, uniform
, etc.
If we would use it as a top-level keyword
grid:
dict:
attribute1:
type: choice
options: [1, 2, 3, 4, 5, 6]
bundle: bundle1
attribute2:
type: uniform
min: 0
max: 1
num: 6
bundle: bundle1
would become something like this:
grid:
bundle1:
- dict.attribute1: 1
dict.attribute2: 0
- dict.attribute1: 2
dict.attribute2: 0.2
- dict.attribute1: 3
dict.attribute2: 0.4
- dict.attribute1: 4
dict.attribute2: 0.6
- dict.attribute1: 5
dict.attribute2: 0.8
- dict.attribute1: 6
dict.attribute2: 1
which, imho, is harder to read and maintain.
A middle ground could be something like this:
grid:
bundle:
dict.learning_rate:
type: uniform
...
dict.other_param:
type: choice
...
I guess my main thought is: What are the use cases for bundles? What is the most natural way of reflecting those use cases?
So I considered different model variants or datasets. In that case you have some parameters that are set to different values for each variant. And then it makes more sense to put all of those values in one place, instead of scattering them around your grid.
But that use case already seems to be covered well by our regular structure of sub-configs, so maybe you had something else in mind?
This notation does not permit the splitting of bundled parameters across the "global" configuration and the sub configs, e.g.:
grid:
dict:
attribute1:
type: choice
options: [1, 2, 3, 4, 5, 6]
bundle: bundle1
sub_config:
grid:
dict:
attribute2:
type: uniform
min: 0
max: 1
num: 6
bundle: bundle1
In my experiments, I often see myself commenting out parts of the configuration, adding them to the collection, commenting out other parts, adding them to the collection, etc. because sub-configurations are already occupied by datasets.
An example would be if I change the optimizer and learning rate. Since I can't use subconfigs as these are reserved for my datasets because they do not fit into the grid notation. So I have to comment in and out 2 lines to test 2 optimizers with different learning rates.
Here's an actual example:
fixed:
wfnet.ferminet_params.determinants: 32
wfnet.ferminet_params.hidden_dims:
[[512, 32, True], [512, 32, True], [512, 32, True], [512, 32, True]]
training.max_steps: 200000
training.patience: 10000
nitrogen:
fixed:
system:
name: diatomic
training:
collection_type: dynamic
n_configs: 16
config:
symbol1: N
symbol2: N
R:
lower: 1.4
upper: 6.5
std: 0.05
validation:
collection_type: static
n_configs: 16
total_samples: 1000000
config:
symbol1: N
symbol2: N
R:
- 1.60151
- 1.70828
- 1.81505
- 1.92181
- 2.02858
- 2.13535
- 2.24212
- 2.34889
- 2.45565
- 2.56242
- 2.66919
- 2.77595
- 2.88272
- 2.98949
- 3.09626
- 3.20302
- 3.30979
- 3.41656
- 3.52333
- 3.63009
- 3.73686
- 3.84363
- 3.95040
- 4.05716
- 4.16393
- 4.27070
- 4.37747
- 4.48423
- 4.59100
- 4.69777
- 4.80454
- 4.91130
- 5.01807
- 5.12484
- 5.23161
- 5.33837
- 5.87221
- 6.40605
cyclobutadiene:
fixed:
system:
name: cyclobutadiene
training:
collection_type: static
config:
state:
- ground
- transition
validation:
collection_type: static
total_samples: 1000000
config:
state:
- ground
- transition
If I want to change training.max_steps
and training.patience
jointly there is no way around commenting them and in out.
With bundled parameters I could do:
...
grid:
training.max_steps:
- type: choice
- options: [100000, 200000]
- bundle: convergence
training.patience:
- type: choice
- options: [5000, 10000]
- bundle: convergence
...
Currently, I could accomplish the same only by copying both sub-collection blocks with different training parameters.
So in this case it seems like you would really want the ability of combining sub-configs.
One way of giving more control over how to handle subconfigs would be something like this:
# This now just defines subconfigs, but we don't do anything with them.
nitrogen:
...
cyclobutadiene:
...
short_training:
...
long_training:
...
# Now the user themself has to define how to use those subconfigs
grid:
config:
- nitrogen
- cyclobutadiene
config:
- short_training
- long_training
other_attribute:
type: choice
options:
- True
- False
This still precludes the ability of using ranges etc, but I'm not really convinced that a range is what you want here in our use cases. Also, coupling ranges seems rather error-prone.
BUT, you can now define a grid in each subconfig, which would give you the power to do other things.
One problem with this would be that there isn't a clear hierarchy of configs anymore. So if you have duplicate parameters across subconfigs there's no clear way of resolving it.
That seems to be quite hard to get right because as you've said then there are suddenly subconfigs which only permit fixed
if you wanna use them in grid
.
I don't see a large downside to the implementation in this PR, its upsides are:
seml
as well in the actual config
files)choice
, uniform
and loguniform
iteration featuresMy main objection with introducing a new keyword or using grid
over configs is that it is much more error-prone and would require significantly more refactoring of the codebase. Catching all the ways one could mishandle these systems seems to be much harder than simply checking whether all parameters in a bundle have the same length.
The current way of implementing it is easier, absolutely. But it seems like a band-aid for a problem that would need a real, clean solution. The real solution certainly takes more effort. But then people don't have to use awkward definitions just because we were stuck in the current frame of thinking when implementing it. I really think we should take a step back and try to find a cleaner way for this.
I think we can improve the solution I've suggested above by keeping the subconfigs we already have, and adding the ability for the user to define parameter bundles, which they can later use in their grids etc. Bundles would work like I've suggested above and only contain fixed parameters.
Something like this:
# Define parameter groups/bundles, to use them later
def:
nitrogen:
...
cyclobutadiene:
...
short_training:
...
long_training:
...
model1:
...
model2:
...
# The user has to specify how to use the parameter groups/bundles
fixed:
model1: load
grid:
dataset: # The name doesn't do anything, but it has to be unique (due to yaml)
type: load
options:
- nitrogen
- cyclobutadiene
training:
type: load
options:
- short_training
- long_training
other_attribute:
type: choice
options:
- True
- False
We'd raise an error if bundles contain conflicting parameters (within and across). Implementation-wise this means that we simply have to exchange nitrogen
for whatever the user defined, before doing all the checks.
Hmm, but it might actually be nice to define full subconfigs (with fixed/grid/random), and then combine them at will later on. Currently, we assume that each subconfig is independent. So instead, we would now say that the user defines how the subconfigs are used or combined. Searching for conflicts might not be that hard, actually.
@klicperajo @danielzuegner what do we do with this PR? Should we merge it or simply keep it open?
Regarding naming, it's called zip
now.
@klicperajo @danielzuegner what do we do with this PR? Should we merge it or simply keep it open?
Regarding naming, it's called
zip
now.
I kind of liked the idea of a more "full-fledged" overhaul in order to support groups/bundles as in the later comment, but I'm not sure if anyone has time to implement it soon-ish. If not, I'm fine with going with zip
for now as it definitely adds useful features to seml
. Let me know what you think.
I'll have time to look into the larger overhaul in March. Would you have time to collaborate on this, @n-gao?
I'd prefer not to merge this until then.
Sure let's postpone the merge until we have a clearer roadmap on the larger overhaul. I am happy to collaborate on this.
Reference issue
At the moment, parameters in
grid
enable one to perform a complete grid search over all parameter ranges. However, in practice one may only change two parameters jointly.What does this implement/fix?
This PR adds functionality that enables one to create parameter bundles in
grid
. All parameters within azip
must have the same number of possible values and are only changed jointly. Here an example.yaml
file:The resulting configurations are:
Additional information
This PR also includes a test case for bundled parameters.