BSC-ES / autosubmit-config-parser

Library used to read Autosubmit 4 experiment data.
3 stars 1 forks source link

Yaml provenance #11

Open LuiggiTenorioK opened 4 months ago

LuiggiTenorioK commented 4 months ago

In GitLab by @mandresm on Jul 11, 2024, 17:24

Summary

As mentioned in previous meetings I want to propose that the conf/metadata/experiment_data.yml contains information about the provenance of each value in the form of a comment. I am opening the issue in order to discuss this implementation strategy, timeline, responsibilities, possible improvements to the feature...

An equivalent feature exists for ESM-Tools, a experiment configuration tool and workflow manager we develop at AWI. I propose we copy/paste from there and start modifying what we need. Here, there is an example of what I have in mind for the equivalent yaml file in ESM-Tools:

fesom:
  model: fesom  # <SOME_ABSOLUTE_PATH>/esm_tools/configs/components/fesom/fesom-2.0.yaml,line:4,col:8
  branch: 2.0.2 # <SOME_ABSOLUTE_PATH>/esm_tools/configs/components/fesom/fesom-2.0.yaml,line:17,col:13
  version: 2 # <SOME_ABSOLUTE_PATH>/esm_tools/configs/setups/awicm3/awicm3.yaml,line:399,col:18
  type: ocean # <SOME_ABSOLUTE_PATH>/esm_tools/configs/components/fesom/fesom-2.0.yaml,line:7,col:7
  comp_command: mkdir -p build; cd build; cmake -DOIFS_COUPLED=ON -DFESOM_COUPLED=ON -DCMAKE_INSTALL_PREFIX=../ ..;   make install -j `nproc --all` # <SOME_ABSOLUTE_PATH>/esm_tools/configs/setups/awicm3/awicm3.yaml,line:414,col:31
  clean_command: rm -rf build CMakeCache.txt # <SOME_ABSOLUTE_PATH>/esm_tools/configs/components/fesom/fesom-2.0.yaml,line:10,col:16
  required_plugins:
  - git+https://github.com/esm-tools-plugins/tar_binary_restarts  # <SOME_ABSOLUTE_PATH>/esm_tools/configs/components/fesom/fesom-2.0.yaml,line:13,col:3
  install_bins: bin/fesom.x  # <SOME_ABSOLUTE_PATH>/esm_tools/configs/components/fesom/fesom-2.0.yaml,line:22,col:19
  git-repository:
  - https://github.com/FESOM/fesom2.git  # <SOME_ABSOLUTE_PATH>/esm_tools/configs/components/fesom/fesom-2.0.yaml,line:20,col:7
  - https://gitlab.dkrz.de/FESOM/fesom2.git # <SOME_ABSOLUTE_PATH>/esm_tools/configs/components/fesom/fesom-2.0.yaml,line:21,col:7

Who am I suggesting that implements this feature?

Either me or @Hussam-Turjman over the next month/2-months, with the support from someone from Autosubmit, for example @dbeltrankyl or @kinow. But if someone at BSC wants to have a head start, help yourself :)

What does this feature support?

All of this won't only be useful for the comments in conf/metadata/experiment_data.yml, but also to question at any point in Autosubmit, the provenance of a given value, simply by using the provenance attribute of that particular value: in a dict my_dict["my_key"].provenance and in a list my_list[my_index].provenance. Could also come pretty handy for improving error messages.

Can we reuse (copy/paste) the code from ESM-Tools?

Yes, our license is GPL-2: https://github.com/esm-tools/esm_tools?tab=GPL-2.0-1-ov-file#readme

Relevant files in ESM-Tools

How can it be implemented?

  1. During the parser of the yaml one needs to extract the line and column information somehow and store it in a collection that has the same structure as the collection loaded from the yaml. We do that with the EsmToolsLoader, a subclass of ruamel.yaml.YAML: https://github.com/esm-tools/esm_tools/blob/6cf5ea8664267a80031b2d54ec6e863cf7da9645/src/esm_parser/yaml_to_dict.py#L693-L770 Note that EsmToolsLoader has some deprecated methods related to the dumping. The most important method there is load

    That uses this constructor class, subclassed from the ruamel.yaml.RoundTripRepresenter https://github.com/esm-tools/esm_tools/blob/6cf5ea8664267a80031b2d54ec6e863cf7da9645/src/esm_parser/yaml_to_dict.py#L638-L673

    Note there we are subclassing from EnvironmentConstructor, which parent class is ruamel.yaml.RoundTripRepresenter. For the implementation here we could directly subclass from ruamel.yaml.RoundTripRepresenter.

    Once the code is implemented one can simply do:

    esm_tools_loader = EsmToolsLoader()
    esm_tools_loader.set_filename(yaml_file)
    yaml_load, provenance = esm_tools_loader.load(yaml_file)

    as in these lines: https://github.com/esm-tools/esm_tools/blob/6cf5ea8664267a80031b2d54ec6e863cf7da9645/src/esm_parser/yaml_to_dict.py#L188-L198

    After this you are going to have your standard collection as read from ruamel.yaml in yaml_load and the provenance, another collection with the same structure as yaml_load in terms of keys, but the values contain provenance objects instead.

  2. Join the two worlds in one single collection, for example, for a dictionary use the class DictionaryWithProvenance:

    dictionary_with_provenance = DictWithProvenance(yaml_load, provenance)

    This dict has now all the provenance information attached to its values and you can use it at your own will. If your collection is a list you can choose to use ListWithProvenance instead of DictWithProvenance. tuples, sets and others are not supported.

    For all the methods related to provenance see the procenance.py itself. It's almost more docstrings than code: https://github.com/esm-tools/esm_tools/blob/release/src/esm_parser/provenance.py

  3. You can now operate with the lists and dictionaries as you would usually do. As long as you are using __setitem__ (or update in the case of the dictionaries) you would keep the provenance history in the provenance attribute of the value, the last entry on the provenance is the actual provenance of its current value:

    my_list_with_prov[2] = my_var_with_prov
    previous_provenance = my_list_with_prov[2].provenance[-2]
    latest_provenance = my_list_with_prov[2].provenance[-1]
  4. Time to dump the Frankenstein dictionary we've been putting together from pieces of other yamls with using the function yaml_dump https://github.com/esm-tools/esm_tools/blob/6cf5ea8664267a80031b2d54ec6e863cf7da9645/src/esm_parser/dict_to_yaml.py#L11-L130

    yaml_dump(your_dict/list_with_prov, "/path/to/the/commented.yaml")

    It's not a very elegant and efficient function, but it does the job, I guess...

LuiggiTenorioK commented 4 months ago

In GitLab by @dbeltrankyl on Jul 16, 2024, 10:00

Hello @mandresm ,

Thanks for explaining the proposal and for trying to implement it! very interesting

I'm the one who wrote the Autosubmit Frankenstein dict, and I'll be on holiday from 22/07 to 05/08. If you or @Hussam-Turjman have any doubts, I can answer them during this week or after my holidays, but @kinow reviewed it a long time ago, so maybe you can also ask him.

Thanks

LuiggiTenorioK commented 2 months ago

In GitLab by @kinow on Sep 6, 2024, 08:08

mentioned in merge request digital-twins/de_340-2/workflow!294

LuiggiTenorioK commented 2 months ago

In GitLab by @mandresm on Sep 25, 2024, 09:30

mentioned in issue digital-twins/de_340-2/workflow#591

LuiggiTenorioK commented 1 month ago

In GitLab by @kinow on Oct 9, 2024, 14:41

mentioned in commit 0b99076f26562fda2376204361b3fb78bd69fd53

LuiggiTenorioK commented 1 month ago

unassigned @mandresm