NOAA-EMC / RDASApp

Regional DAS
GNU Lesser General Public License v2.1
1 stars 10 forks source link

yaml-tools - easily examine, compare, manipulate, and breakdown/assemble YAML files #187

Open guoqing-noaa opened 1 week ago

guoqing-noaa commented 1 week ago

yaml-tools

YAML is one of the core components of the JEDI system. Efficiently handling YAML files is crucial for utilizing JEDI in both research and operational development. We need simple, intuitive, and user-friendly YAML tools to help scientists easily examine, compare, manipulate, and breakdown/assemble YAML files.

Python offers PyYAML module, which is powerful for developers to control details over YAML files. However, it comes with a learning curve and requires coding/debugging.

On the other hand, yq a lightweight and portable command-line YAML, JSON and XML processor. While useful, it lacks a few key features that are essential for JEDI YAML file manipulation:

  1. JEDI YAML files often include spaces in key names, such as cost function, but currently, as far as I know, yq does not support handling spaces in key names.
  2. yq does not provide a quick way to view top-level keys at the current nesting level.
  3. yq does not support traversing a YAML file to output a tree structure of its keys.

A PyYaml-based yaml-tools repository is developed to address the above limitations. This repo includes the following utilities:

1. ycheck

This script just loads a yaml file and then dumps data to stdout. If a yaml file contains non-standard elements, it will halt and provide detailed error information.
ycheck sample.yaml

2. yquery

This script queries a given element using a query string.

yquery sample.yaml ["key1/key2/0"] [shallow|traverse|dump|changeto=""]

Mini Tutorial

This repository assumes the current Python environment has installed the PyYAML module.
On NOAA RDHPCS, PyYAML can be found in the RDASApp EVA Python environment.

git clone https://github.com/NOAA-EMC/RDASApp
cd RDASApp
source ush/load_eva.sh
git clone https://github.com/rrfsx/yaml-tools.git
cd yaml-tools

1. ycheck

./ycheck samples/raw.mpasjedi_en3dvar.yaml

You will get the following error message:

...
  File "/lfs5/BMC/wrfruc/gge/miniconda3/4.6.14/envs/eva/lib/python3.9/site-packages/yaml/scanner", line 258, in fetch_more_tokens
    raise ScannerError("while scanning for the next token", None,
yaml.scanner.ScannerError: while scanning for the next token
found character '%' that cannot start any token
  in "samples/raw.mpasjedi_en3dvar.yaml", line 96, column 16

You can diff this file with samples/mpasjedi_en3dvar.yaml to see what changes can fix this error.

2. yquery

./yquery samples/rrfs_mpasjedi_2024052700_Ens3Dvar.yaml
./yquery samples/rrfs_mpasjedi_2024052700_Ens3Dvar.yaml "cost function"
./yquery samples/rrfs_mpasjedi_2024052700_Ens3Dvar.yaml "cost function/background"
./yquery samples/rrfs_mpasjedi_2024052700_Ens3Dvar.yaml "cost function/background" traverse
./yquery samples/rrfs_mpasjedi_2024052700_Ens3Dvar.yaml "cost function/background" dump > background.yaml
vi background.yaml
./yquery samples/rrfs_mpasjedi_2024052700_Ens3Dvar.yaml "cost function/observations/observers/0/obs filters/0/action" traverse
./yquery samples/rrfs_mpasjedi_2024052700_Ens3Dvar.yaml "cost function/observations/observers/0/obs filters/0/action" dump

3. ybreakdown

./ybreakdown samples/rrfs_mpasjedi_2024052700_Ens3Dvar.yaml
cd rrfs_mpasjedi_2024052700_Ens3Dvar.yaml/cost function/observations/observers

Under the observers subdirectory, you can see 16 observers and you can compare configurations from different observers.

guoqing-noaa commented 1 week ago

This yaml-tools was motivated by my struggles to navigate through the giant combined ctest YAML files under rrfs-test/testinput from PR #184 . I found that these tools greatly help me to get a better understanding of the nesting structure of the YAML files. I'm happy to offer a mini-tutorial (5~15 minutes) for RDASApp developers and users.

SamuelDegelia-NOAA commented 6 days ago

@guoqing-noaa These yaml tools will be useful when looking through some of the giant combined yamls. We are having some ongoing discussions about how to best save these files in the repo since looking at the giant yamls can make it very hard to review or make edits. The best strategy for now is to probably focus on reviewing the individual yaml files in rrfs-test/validated_yamls/templates/basic_config and rrfs-test/validated_yamls/templates/obtype_config. But at least for #184, we also need to make sure that the catting method is working.

ShunLiu-NOAA commented 6 days ago

@guoqing-noaa, @SamuelDegelia-NOAA and @hu5970 These yaml tools are good. Let's discuss how to implement these tools in RDASApp.

guoqing-noaa commented 5 days ago

@ShunLiu-NOAA @delippi @SamuelDegelia-NOAA @hu5970 Thanks for lots of discussions today!

To clarify, today I only presented YAML tools for developers/users to easily navigate and manipulate YAML files.

How to mange YAML files in RDASAapp and rrfs-workflow is another topic, we can have a small group meeting to further discuss this.

A few thoughts I have now:

  1. In the rrfs-workflow, DA monitoring will definitely check the final giant YAML directly-used by JEDI instead of any intermediate steps (even though they may include the same contents , because, in practices, this is NOT always the case since any unexpected events can happen in operation or real-time).

  2. Any RDASApp YAML files should be able to be parsed by industry yaml tools without errors.

  3. Any RDASApp YAML files should be able to be directly used by JEDI without much work. We may combine different YAML files together, but the individual YAML file should be able to work out-of-the-box as well

SamuelDegelia-NOAA commented 5 days ago

Thank you for the presentation today @guoqing-noaa. I think this will probably be a point of discussion for us for a while going forward. I agree that we should make sure the yaml files in RDASApp can be read by the yaml tools (issue #186). Those tools are useful for parsing through very large files like we want to use for the ctests. I will update and test the yaml files in #184 for this.

Regarding your point 3, this would mean that the yaml files in rrfs-test/validated_yamls/templates/obtype_configs would need to have many sections added. However, we are using those obtype yamls for both EnVar and GETKF tests which require different settings. So if we wanted to implement point 3, we would need to create multiple copies of them, making future development more tedious.

I think our current way of doing things using gen_yaml.sh is pretty clean because everything is split into its individual components. This way, you can easily create a yaml file ready for DA using whichever ob types or DA method you want. And if you want to make edits, you only have to edit the file specific to the observation or DA method of your choosing.

guoqing-noaa commented 5 days ago

@SamuelDegelia-NOAA That's a good point!

Previously in the GSI world, EnKF will get the OMB from GSI, so we don't need to do observer using the same obs configuration duplicately. Do we know why we cannot do a similar thing in JEDI? Sorry I am not familiar with this part.

SamuelDegelia-NOAA commented 5 days ago

For the GETKF, the observer needs to be run on the modulated and real ensemble members. This means we cannot use the hofx files produced by the EnVar step and instead need to run the observer separately for GETKF (the hofx files will be much larger). JEDI also expects different settings for the GETKF yaml file (i.e., the driver and local ensemble DA sections) such that we cannot reuse yamls from EnVar. As such, we end up needing separate yaml files for both EnVar and GETKF (solver and observer components).

guoqing-noaa commented 4 days ago

@SamuelDegelia-NOAA Thanks again for your good point about reusing the observations object for both Var and EnKF yaml files. I think I have an alternate solution to this and I will show it here. Tag @ShunLiu-NOAA @hu5970 @delippi for awareness

I support we make every effort to make each YAML file can be used out-of-the-box without much extra effort and avoid manually adding spaces to align the observations object for different YAML files at different nesting levels.

========
The general idea is that we make edits in the VAR YAML files for the observations section and make sure everything is right there first. And then we dump this observations object from the VAR YAML file to the EnKF YAML files using the yaml-tools. Note: both the VAR YAML files and the EnKF YAML files here can be directly used out-of-the-box.

Here is a demo with the 131 observation:

  1. load the EVA Python environment:

    source $RDASApp/ush/load_eva.sh
  2. clone yaml-tools and module load it:

    git clone git@github.com:rrfsx/yaml-tools
    module use yaml-tools/modulefiles
    module load yaml-tools
    cd yaml-tools/samples/131
  3. compare the observations objects from ens3dvar.yaml and observer.yaml

    yquery ens3dvar.yaml "cost function/observations" dump > txt.ens3dvar
    yquery observer.yaml "observations" dump > txt.observer
    diff txt.ens3dvar txt.observer

    we can see they are the same.

  4. Now let's modify the observations object in ens3dvar.yaml You can manually modify ens3dvar.yaml.
    But for this demo, let's use yquery to edit one leaf value:

    yquery ens3dvar.yaml "cost function/observations/observers/0/obs filters/3/where/0/is_in" edit=133 > newens3dvar.yaml
    diff ens3dvar.yaml newens3dvar.yaml
  5. Dump the new observations object from newens3dvar.yaml into observer.yaml

    yquery newens3dvar.yaml "cost function" dump | yquery observer.yaml "observations" edit=pipe > newobserver.yaml
    diff observer.yaml newobserver.yaml

    We can see that newobserver.yaml now has the expected observations change.
    (Ignore the cosmetic format changes in increment variables which does not matter)

  6. More yquery now support traverse/dump/edit/delete a key from dictionary or an item from list, as well as append a new key or an item. Let me know if you have any questions.

Also to clarify, these tools are expected to help developers, aiming to streamline things and reduce tedious manual editing, without increasing burdens. There is no intention to mandate anyone to use them. If one can use other tools/methods to achieve similar capabilities, that is great.

SamuelDegelia-NOAA commented 4 days ago

Thank you @guoqing-noaa for the helpful example! We will give some thoughts to this.

guoqing-noaa commented 4 days ago

A heads up, the merge of different VAR YAML files into one is expected to be much easier than the yquery tool and I will get it done today.

guoqing-noaa commented 4 days ago

I added more documentation:

  1. User Guide: https://github.com/rrfsx/yaml-tools/wiki/yaml%E2%80%90tools-user-guide

  2. Tutorial https://github.com/rrfsx/yaml-tools/wiki/YAML%E2%80%90TOOLS-tutorial

For the merge capability, here are some examples (taken from the above tutorial):

ymergeList mini01.yaml "demo/configuration/detail" mini02.yaml mini03.yaml mini04.yaml
ymergeList 131var.yaml "cost function/observations/observers" 233var.yaml 188var.yaml > combine.yaml
guoqing-noaa commented 3 days ago

@ShunLiu-NOAA @hu5970 @SamuelDegelia-NOAA @delippi For the RDASApp YAML file management, we may NOT need to make changes. We can still use the current gen_yaml_ctest.sh, but it will good to commit the four combined last-step 'rrfs_mpasjedi*yaml' files into the repository so that other developers can use them as workable templates. And yaml-tools make it easy to remove other obs spaces so that one can work on one, two or a few limited observations.

guoqing-noaa commented 3 days ago

FYI, Here is a simple bash script which check all is_in field in the giant combined rrfs_mpasjedi_2024052700_Ens3Dvar.yaml

#!/bin/bash

for i in $(seq 0 16); do
  echo "====== $i"
  yquery rrfs_mpasjedi_2024052700_Ens3Dvar.yaml "cost function/observations/observers/$i" dump | grep "is_in"
done
guoqing-noaa commented 3 days ago

ok, I understand that some people are not comfortable about specifying the "query string". The yaml-tool is a general tool and can be used for any YAML files.

But for JEDI-specific YAML files, we can make it much easier for users by wrapping some details in the bash scripts.

I have developed 5 BASH scripts based on yaml-tool which are much easier to use and there is almost no learning curves. Here is a glance of the usage:

cd yaml-tools/samples
mkdir tmp
cd tmp
jediYSplitObsr.sh ../rrfs_mpasjedi_2024052700_Ens3Dvar.yaml
jediYEmptyObsr.sh ../rrfs_mpasjedi_2024052700_Ens3Dvar.yaml
jediYOneObsr.sh ../rrfs_mpasjedi_2024052700_Ens3Dvar.yaml yamlobs.0.yaml  > ens3dvar.0.yaml
jediYOneObsr.sh ../rrfs_mpasjedi_2024052700_Ens3Dvar.yaml yamlobs.8.yaml  > ens3dvar.8.yaml
jediYMergeVarObsr.sh ens3dvar.0.yaml ens3dvar.8.yaml > ens3dvar.0+8.yaml

For details, please visit: https://github.com/RRFSx/yaml-tools/wiki/Tutorial%E2%80%90Dealing-with-JEDI-observations-in-YAML-files

delippi commented 3 days ago

@guoqing-noaa, I feel like this is more complicated than needed. The tools might be good to use for some instances, but really people should be just looking at the obs space yamls individually. They are short enough to handle. Don't even look at the large generated yaml--there's no reason to. You've made some bash scripts with "almost no learning curves", well that is exactly what the gen_yaml.sh has done but simpler--I promise you that it is. You just simply comment out the files you don't want to cat to your super.yaml. The obs space yamls are written to be modular and so that you can easily compare side by side and reuse for new obtypes since most things could be carried over. We don't need to have yamls that work "out of the box" because it is simple enough to just cat them. Furthermore, this is all just temporary until we move to JCB to build our yamls.

guoqing-noaa commented 9 hours ago

@delippi

In NCO operation, almost everything is fixed and some aspects get much simpler. But the research and development will have much more needs on efficiently handling the final giant YAML files than you may anticipate. I see many of such needs. It takes time to cover all of them but here I can give a few examples beyond our normal data analysis tasks:

  1. Data assimilation (DA) performance monitoring: DA monitoring is a critical part of the RRFS system and apparently, for each cycle, we will get DA basic configuration information from the run directory directly, using the final giant YAML files. We will NOT go back and track down to rrfs-test/validated_yamls/templates/obtype_config and the exrrfs_da.sh to find out what observations were assimilated, what obs errors were set, what filters were used, etc. For archived RRFS runs, we will only archive the giant YAML files and will not archive the RDASApp hash we used for that run, at least at the develop stage.

  2. Verification The verification team will utilize JEDI’s UFO to compute O-B and then calculate all other key statistics. They will also get the giant YAML files from the RRFS run directories directly and then they can use yaml-tools to easily filter out unnecessary obsSpaces but keep those needed for the verification on the targeted types of observations.

  3. Collaboration with Research Partners We plan to strengthen our collaboration on MPASJEDI with NCAR/MMM, who are the primary maintainers and active developers of MPASJEDI. The most straightforward and effective way to share DA configurations with them is to exchange the final YAML files rather than the intermediate steps used to generate them. This situation applies for collaborations among all MPASJEDI research partners (such as NCAR, CW3E, etc)

  4. Comparisons among one’s own DA Experiments: I can imagine in the next few years, lots of us will conduct many DA experiments with different DA configurations and each may use different versions of RDASApp. The natural way to check what DA configurations have been used among different experiments is to use the giant YAML files, rather than tracking down and referencing each RDASApp hash.

Anyway, the final giant YAML file is the “gold standard” when we talk about DA configurations among different experiments, different research partners, different applications (verification, DA monitoring, etc), and more. And the yaml-tools will greatly facilitate navigating and manipulating these files