TUM-DAML / seml

SEML: Slurm Experiment Management Library
Other
165 stars 29 forks source link

Named configs and config resolution #119

Closed dfuchsgruber closed 11 months ago

dfuchsgruber commented 1 year ago
Named configs and config resolution ### What does this implement/fix?

This PR introduces two main features: Named configs and resolving configurations against sacred.

Named Configs

This enables the use of named configurations as supported in sacred. Named configs allow the user to define groups of configurations in several formats (python, yaml, json, ...) to make for a more composable configuration framework. When calling the sacred CLI, named configs would be passed as python experiment.py with key1=value1 key2=value2 ... named_config, where named config can be a file or the name of a experiment.named_config decorated method. They will be executed in the order of their listing. Individual key-value pairs will be always override configurations defined in named configs.

The functionality is recovered by defining a named config as a parameter group in the seml configuration file that uses the prefix Settings.NAMED_CONFIG_PREFIX, i.e. '$named_config'. This parameter group must define:

Resolving configs

When adding experiments, seml configurations will be resolved against the sacred experiment and all resulting parameters are extracted. E.g., if the experiment code defines additional parameters not found in the seml configuration, they will be part of the resolved configuration. These resolved configurations will be the basis for experiment hashes and duplicate detection. The unresolved configurations will also be stored as in a config_unresolved field in the MongoDB entry.

This resolution process also includes aforementioned named configs: They will be resolved and the config entry of the MongoDB document will have key-value pairs only that may also be extracted from running all the named configs. E.g., if in a named config a field param1 is defined that the seml experiment configuration does not define, this field will now be listed in the MongoDB.

Upon running reload-sources, the previously stored unresolved configuration will be re-resolved against the new source code (and therefore can be used for duplicate detection again). The rationale behind is that researchers often define new parameters where previous experiments implicitly used one value of the parameter as a default value: Example, imagine using the Adam optimizer by default and running a bunch of experiments that do not explicitly have a corresponding config key. Later, you decide that you also want to try a different optimizer and introduce a corresponding field. Reloading your previous experiments and resolving their config against the new source code will correctly put the default value (Adam) into the configuration for old experiments (and duplicate detection works without hacky workarounds or manually fumbling with the MongoDB).

The changes can be summarized as follows:

Omegaconf

We now additionally use omegaconf to enable variable interpolation in the experiment configurations. Variable interpolation is employed after named configs are resolved. This means, interpolated values are not available for the logic of named python configs.

Furthermore, variable interpolation is also added to the descriptions of experiments. It can be disabled with the --no-resolve-descriptions to add and description set

Additional information

dfuchsgruber commented 1 year ago

This is a really great addition to seml! A few high-level comments:

  1. Could we get documentation of the example in the readme.md in /examples?

Done.

  1. Do we want to change $named_config by default to $config for briefity?

I think we should change it to something not including any '$', since, as you mentioned later, this conflicts with MongoDB queries sometimes. Maybe something like '_NAMED_CONFIG' or '_CONFIG'?

  1. are named configs higher or lower in priority than explicit overwrites? I'd argue that they should be lower?

Good point, intuitively they are listed with the priority as the key. I think that makes sense (at least to me) coming from the sacred CLI. One could also use an inverse ordering. Or we rename this field to order.

  1. Do we with reload-sources always start from the unresolved one or only update non-existing keys? Should this behavior be handled by a flag or is always starting from the base config the most sensible way?

I think just re-resolving against new sources is the most straightforward and leads to the least amount of side-effects, as it is clear cut what the command will do.

  1. seml test add example_config.yaml fails with
UnsupportedValueType: Value 'float64' is not a supported primitive type
    full_key: learning_rate
    object_type=dict

I guess this comes from omegaconf trying to resolve the numpy generated numbers. We should probably convert them to floats? In a broader picture this affects all numpy numbers in named configs in general, right? Also, we have to distinguish between int and floats.

Done. I added value_to_primitive_datatype which will recursively convert subclasses of str, float and int to the primitive datatype.

  1. Running pytest gives deprecation warnings. DeprecationWarning: Deprecated call to `pkg_resources.declare_namespace('ruamel')`. Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages declare_namespace(pkg)

I am not sure what triggers this. I changed pkg_resources usage to importlib, but as expected the error prevails. Is there a way to find out which module triggers it (i.e. which one declares ruamel)? Btw, I get a different deprecation warning:

../../../../staff-ssd/fuchsgru/miniconda3/envs/seml_pr/lib/python3.8/site-packages/pkg_resources/__init__.py:121
  /nfs/staff-ssd/fuchsgru/miniconda3/envs/seml_pr/lib/python3.8/site-packages/pkg_resources/__init__.py:121: DeprecationWarning: pkg_resources is deprecated as an API
    warnings.warn("pkg_resources is deprecated as an API", DeprecationWarning)
  1. For a collection containing the advanced_example the following query fails: seml advanced status -p config_unresolved.$named_config_preprocessing with OperationFailure: FieldPath must not end with a '.'., full error: {'operationTime': Timestamp(1689577796, 2), 'ok': 0.0, 'errmsg': "FieldPath must not end with a '.'.", 'code': 40353, 'codeName': 'Location40353', '$clusterTime': {'clusterTime': Timestamp(1689577796, 2), 'signature': {'hash': b'\x99\\\x10]\xd4\xd6\x1bI\xf0t\x0f\x89e\x15\xaa5\x18\xd5>\x14', 'keyId': 7223038808843878402}}}.

As mentioned above, this is because in a query MongoDB wants to resolve all $. We should not use these characters anywhere in MongoDB, so I'd argue for not using them in the named config prefix (see above).

I'll have to do some more testing later :)