Closed dfuchsgruber closed 11 months ago
This is a really great addition to seml! A few high-level comments:
- Could we get documentation of the example in the readme.md in /examples?
Done.
- Do we want to change $named_config by default to $config for briefity?
I think we should change it to something not including any '$'
, since, as you mentioned later, this conflicts with MongoDB queries sometimes. Maybe something like '_NAMED_CONFIG'
or '_CONFIG'
?
- are named configs higher or lower in priority than explicit overwrites? I'd argue that they should be lower?
Good point, intuitively they are listed with the priority as the key. I think that makes sense (at least to me) coming from the sacred
CLI. One could also use an inverse ordering. Or we rename this field to order
.
- Do we with
reload-sources
always start from the unresolved one or only update non-existing keys? Should this behavior be handled by a flag or is always starting from the base config the most sensible way?
I think just re-resolving against new sources is the most straightforward and leads to the least amount of side-effects, as it is clear cut what the command will do.
seml test add example_config.yaml
fails withUnsupportedValueType: Value 'float64' is not a supported primitive type full_key: learning_rate object_type=dict
I guess this comes from omegaconf trying to resolve the numpy generated numbers. We should probably convert them to floats? In a broader picture this affects all numpy numbers in named configs in general, right? Also, we have to distinguish between int and floats.
Done. I added value_to_primitive_datatype
which will recursively convert subclasses of str
, float
and int
to the primitive datatype.
- Running
pytest
gives deprecation warnings.DeprecationWarning: Deprecated call to `pkg_resources.declare_namespace('ruamel')`. Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages declare_namespace(pkg)
I am not sure what triggers this. I changed pkg_resources
usage to importlib
, but as expected the error prevails. Is there a way to find out which module triggers it (i.e. which one declares ruamel
)? Btw, I get a different deprecation warning:
../../../../staff-ssd/fuchsgru/miniconda3/envs/seml_pr/lib/python3.8/site-packages/pkg_resources/__init__.py:121
/nfs/staff-ssd/fuchsgru/miniconda3/envs/seml_pr/lib/python3.8/site-packages/pkg_resources/__init__.py:121: DeprecationWarning: pkg_resources is deprecated as an API
warnings.warn("pkg_resources is deprecated as an API", DeprecationWarning)
- For a collection containing the
advanced_example
the following query fails:seml advanced status -p config_unresolved.$named_config_preprocessing
withOperationFailure: FieldPath must not end with a '.'., full error: {'operationTime': Timestamp(1689577796, 2), 'ok': 0.0, 'errmsg': "FieldPath must not end with a '.'.", 'code': 40353, 'codeName': 'Location40353', '$clusterTime': {'clusterTime': Timestamp(1689577796, 2), 'signature': {'hash': b'\x99\\\x10]\xd4\xd6\x1bI\xf0t\x0f\x89e\x15\xaa5\x18\xd5>\x14', 'keyId': 7223038808843878402}}}
.
As mentioned above, this is because in a query MongoDB wants to resolve all $
. We should not use these characters anywhere in MongoDB, so I'd argue for not using them in the named config prefix (see above).
I'll have to do some more testing later :)
This PR introduces two main features: Named configs and resolving configurations against sacred.
Named Configs
This enables the use of named configurations as supported in
sacred
. Named configs allow the user to define groups of configurations in several formats (python, yaml, json, ...) to make for a more composable configuration framework. When calling thesacred
CLI, named configs would be passed aspython experiment.py with key1=value1 key2=value2 ... named_config
, where named config can be a file or the name of aexperiment.named_config
decorated method. They will be executed in the order of their listing. Individual key-value pairs will be always override configurations defined in named configs.The functionality is recovered by defining a named config as a parameter group in the seml configuration file that uses the prefix
Settings.NAMED_CONFIG_PREFIX
, i.e.'$named_config'
. This parameter group must define:name
: The name of the named configuration. This is the string that would be appended to thesacred
CLI, e.g.my_config.yaml
, if you want to load a yaml configuration file.priority
(optional): If multiple named configs are defined, they are sorted according to this priority (lowest first). Named configs with a low priority are resolved first and thus high priority configs will superceed definitions of previous ones. If not given, it will always be treated as highest priority. Ties are broken on the name of the parameter group.Resolving configs
When adding experiments, seml configurations will be resolved against the sacred experiment and all resulting parameters are extracted. E.g., if the experiment code defines additional parameters not found in the seml configuration, they will be part of the resolved configuration. These resolved configurations will be the basis for experiment hashes and duplicate detection. The unresolved configurations will also be stored as in a
config_unresolved
field in the MongoDB entry.This resolution process also includes aforementioned named configs: They will be resolved and the
config
entry of the MongoDB document will have key-value pairs only that may also be extracted from running all the named configs. E.g., if in a named config a fieldparam1
is defined that the seml experiment configuration does not define, this field will now be listed in the MongoDB.Upon running
reload-sources
, the previously stored unresolved configuration will be re-resolved against the new source code (and therefore can be used for duplicate detection again). The rationale behind is that researchers often define new parameters where previous experiments implicitly used one value of the parameter as a default value: Example, imagine using the Adam optimizer by default and running a bunch of experiments that do not explicitly have a corresponding config key. Later, you decide that you also want to try a different optimizer and introduce a corresponding field. Reloading your previous experiments and resolving their config against the new source code will correctly put the default value (Adam) into the configuration for old experiments (and duplicate detection works without hacky workarounds or manually fumbling with the MongoDB).The changes can be summarized as follows:
add
resolves configs (incl. named configs) against the source code: The corresponding key-value pairs will be theconfig
of the experiment. Experiments will be run with these key-value pairs (this way named configs are realized without changing the CLI call of seml experiments)reload-sources
will re-resolve configs against the new source codedetect-duplicates
will print duplicate groups in a collectionstatus
will also display duplicate entriesOmegaconf
We now additionally use omegaconf to enable variable interpolation in the experiment configurations. Variable interpolation is employed after named configs are resolved. This means, interpolated values are not available for the logic of named python configs.
Furthermore, variable interpolation is also added to the descriptions of experiments. It can be disabled with the
--no-resolve-descriptions
toadd
anddescription set
Additional information
_SEML_COMPLETE=1 typer seml.__main__ utils docs --name seml --output docs.md
or did not change the CLI.