Open LuiggiTenorioK opened 4 months ago
In GitLab by @dbeltrankyl on Jul 16, 2024, 10:00
Hello @mandresm ,
Thanks for explaining the proposal and for trying to implement it! very interesting
I'm the one who wrote the Autosubmit Frankenstein dict, and I'll be on holiday from 22/07 to 05/08. If you or @Hussam-Turjman have any doubts, I can answer them during this week or after my holidays, but @kinow reviewed it a long time ago, so maybe you can also ask him.
Thanks
In GitLab by @kinow on Sep 6, 2024, 08:08
mentioned in merge request digital-twins/de_340-2/workflow!294
In GitLab by @mandresm on Sep 25, 2024, 09:30
mentioned in issue digital-twins/de_340-2/workflow#591
In GitLab by @kinow on Oct 9, 2024, 14:41
mentioned in commit 0b99076f26562fda2376204361b3fb78bd69fd53
unassigned @mandresm
In GitLab by @mandresm on Jul 11, 2024, 17:24
Summary
As mentioned in previous meetings I want to propose that the
conf/metadata/experiment_data.yml
contains information about the provenance of each value in the form of a comment. I am opening the issue in order to discuss this implementation strategy, timeline, responsibilities, possible improvements to the feature...An equivalent feature exists for ESM-Tools, a experiment configuration tool and workflow manager we develop at AWI. I propose we copy/paste from there and start modifying what we need. Here, there is an example of what I have in mind for the equivalent yaml file in ESM-Tools:
Who am I suggesting that implements this feature?
Either me or @Hussam-Turjman over the next month/2-months, with the support from someone from Autosubmit, for example @dbeltrankyl or @kinow. But if someone at BSC wants to have a head start, help yourself :)
What does this feature support?
update
,__setitem__
, etc. keep a history of the value's provenance history, and other methods to recursively retrieve and set the provenance values. Also aclean_provenance
method to recursively return the original value and value type.All of this won't only be useful for the comments in
conf/metadata/experiment_data.yml
, but also to question at any point in Autosubmit, the provenance of a given value, simply by using the provenance attribute of that particular value: in a dictmy_dict["my_key"].provenance
and in a listmy_list[my_index].provenance
. Could also come pretty handy for improving error messages.Can we reuse (copy/paste) the code from ESM-Tools?
Yes, our license is GPL-2: https://github.com/esm-tools/esm_tools?tab=GPL-2.0-1-ov-file#readme
Relevant files in ESM-Tools
How can it be implemented?
During the parser of the yaml one needs to extract the line and column information somehow and store it in a collection that has the same structure as the collection loaded from the yaml. We do that with the
EsmToolsLoader
, a subclass ofruamel.yaml.YAML
: https://github.com/esm-tools/esm_tools/blob/6cf5ea8664267a80031b2d54ec6e863cf7da9645/src/esm_parser/yaml_to_dict.py#L693-L770 Note thatEsmToolsLoader
has some deprecated methods related to the dumping. The most important method there isload
That uses this constructor class, subclassed from the
ruamel.yaml.RoundTripRepresenter
https://github.com/esm-tools/esm_tools/blob/6cf5ea8664267a80031b2d54ec6e863cf7da9645/src/esm_parser/yaml_to_dict.py#L638-L673Note there we are subclassing from
EnvironmentConstructor
, which parent class isruamel.yaml.RoundTripRepresenter
. For the implementation here we could directly subclass fromruamel.yaml.RoundTripRepresenter
.Once the code is implemented one can simply do:
as in these lines: https://github.com/esm-tools/esm_tools/blob/6cf5ea8664267a80031b2d54ec6e863cf7da9645/src/esm_parser/yaml_to_dict.py#L188-L198
After this you are going to have your standard collection as read from
ruamel.yaml
inyaml_load
and theprovenance
, another collection with the same structure asyaml_load
in terms of keys, but the values contain provenance objects instead.Join the two worlds in one single collection, for example, for a dictionary use the class
DictionaryWithProvenance
:This dict has now all the provenance information attached to its values and you can use it at your own will. If your collection is a list you can choose to use
ListWithProvenance
instead ofDictWithProvenance
.tuples
,sets
and others are not supported.For all the methods related to provenance see the
procenance.py
itself. It's almost more docstrings than code: https://github.com/esm-tools/esm_tools/blob/release/src/esm_parser/provenance.pyYou can now operate with the lists and dictionaries as you would usually do. As long as you are using
__setitem__
(orupdate
in the case of the dictionaries) you would keep the provenance history in the provenance attribute of the value, the last entry on the provenance is the actual provenance of its current value:Time to dump the Frankenstein dictionary we've been putting together from pieces of other yamls with using the function
yaml_dump
https://github.com/esm-tools/esm_tools/blob/6cf5ea8664267a80031b2d54ec6e863cf7da9645/src/esm_parser/dict_to_yaml.py#L11-L130It's not a very elegant and efficient function, but it does the job, I guess...