Spearation of information and processes, or the dilemma of postprocessing

JanStreffing commented 4 years ago

Hello together,

I thought I should explain a bit my thoughts on the postprocessing.

Let's step back a bit and look at old-style runscripts like the MPI-ESM from three years ago. Those were quite lengthy files that conflated information about the model, the computer and the experiment with programming of e.g. calender functions, namelist modification, hostfile generation, runtime loop etc. This was fairly inefficient, as users of all levels, including end-users were always confronted with some of the guts at an interface level. This leads to slower progress of setting up experiments and ultimately does a disservice to research in earth system sciences.

The ksh version of the ESM-Tools fixed that by dissecting the runscripts along an information/process boundary. Processes, which were often the same across different models, were moved into the back-end and worked on mainly by core developers (Dirk and Paul), while pure information was placed into the front-end, where this became lists of information, not unlike an extra namelist or the yaml files we use now. Those we called runscripts and they were mainly what scientists/end users saw. In between there was a grey area, a place where the separation of information and programming was not complete, and those were the model and coupling specific .function files. They contained primarily three types of content:

default values for everything that was in the runscript-list
runscript related processes that were not universal but model specific
post-processing

and they were mainly worked on by model developers (Joakim, Martin B., me and others).

With the python esm_tools we are now trying to complete the separation. The default values went into a yaml (I like that) and you try to include even one off processes for model runscripts into the core functionally (more work for you, fine by me). However for post-processing the list of potential candidate methods for inclusion is sheer endless, and so it's not practical to do that in the core. To keep with the theme of separation of information and programming you defined an interface with which model developers create their own post-processing command interface. Model devlopers would then use their own interface and place the post-processing commands in a list to chain them together.

To me it seems that this does not make life easier on the model developer level, as programming the post processing still has to be done anyway, but then additionally it is translated into the yaml interface at a command level and then pieced back together again from the commands defined in the self-made interface. If you were to make this list tool wide, not every command would have to be implemented again for every model, however you would end up with a very long and unwieldy list after a short while, because post-processing task are so diverse. So that too does not sound all the appealing to me.

My hunch is that keeping a small grey zone where info and programming are mixed would actually lead to faster and easier implementation of new post-processing features and models, thus makeing esm_tools a more productive tool for earth systems sciences.

I'm open to any attempts to convert me over ;-) But currently I would prefer a shell script.

Cheers Jan

pgierz commented 4 years ago

...post-processing is still on my (never shrinking) TODO list. I understand that mixing YAML/Python is likely a hurdle to jump for most people wanting to do post-jobs, especially if we don't have a good example of how to do that. Even more so if the colleagues trying to do this aren't Python fluent. Complete separation of programming logic and model information will be akin to Sisyphus and his boulder...

How about the following:

We provide a clean interface for attaching scripts (regardless of language) into the job chain. This is already more-or-less implemented with Dirk's plugin functionality (which is in dire need of documentation!) Currently, it only consumes Python dictionaries.
We could (quite easily I guess; it's just a try/except block) provide an interface to alternatively list shell scripts which would be executed. This interface would simply check, in the following order, which criteria apply:

a. Did you list a Python function or method? Yes? Hooray. Do that. We expect config in as the argument, and config out as the return

b. Did you list a (shell) script? This would be checked via filename extension (to keep the programming easier, it might be nice to define files that end in <lalala>.esm_extension...or something similar. We could just try if importing the file in Python works and otherwise assume it's something else) This would be executed via subprocesses.run and receive as arguments anything the YAML recipe would give as arguments/keyword arguments. This would necessitate changing the recipe to a list of dictionaries (for args/kwargs) rather than just a list of strings. Not sure exactly what to do with the config. Just take it in and spit it back out?

c. You did something weird that doesn't fit into the above two options. Abort.

One big downside here is that it opens up a huge gate for people to do whatever they want. Erase the entire experiment tree? sure, why not. Override every file with a cdo random? ok, you can do that... I doubt anyone will actually do that; but still....

For Jan: this would imply the following:

experiment_recipie:
    # ....default run stuff....
    - /path/to/my/script.sh:
        - args:
            - toot
            - tralala
    # ...other steps...

and in /path/to/my/script.sh:

# Some default bootstrapping would go here to make sure you get all the variables in YAML
arg1=$1 # Here is toot
arg2=$2 # Here is tralala

cdo info $arg1

joakimkjellsson commented 4 years ago

Hey guys See my branch "foci2" for examples of this. Look at the end of configs/oifs/oifs.yaml I took Dirk's plugin "preprocess" and made one called "postprocess". Then I've written oifs-preprocess.sh and oifs-postprocess.sh scripts and put them in configs/oifs/.

My thoughts on this is: 1) Jan is right. For ECHAM, postprocessing mostly means "cdo after", so that's easy. For OpenIFS it's a mix of NCO, CDO, GRIB_API and bash commands. Defining ways to do this in a yaml-based script sounds like a monumental task. Then we add some statistics stuff (like EP fluxes, overturning stream functions etc.). I'm usually in favour of the option that takes the fewest lines of code, and maybe bash scripts would win here?

2) There are now two plugins, "preprocess" and "postprocess", but they do exactly the same thing: launch a bash script. Only difference is that they do it at different stages in the run. Replacing the plugins by a functionality in ESM-Tools where we can launch a bash script at any time would be excellent. Also, making sure that the scripts "see" all the environment variables set by "mistral.yaml" or "ollie.yaml", i.e. finds correct CDO/ecCodes/NCO.

3) The way forward could be to split the current postprocessing scripts into smaller components. For OpenIFS: One script that concatenates the ICM files (only "cat" commands). One script that splits into one file per level type (only grib_copy). One script that converts to netCDF and subtracts last step (only CDO). One script that remaps to coarser grid (only CDO). One script that computes daily means, monthly means etc. (only CDO). Then we could think about translating a few of these into the yaml-style methods.

4) The pre/postprocessing scripts should be their own git project! We should be able to share them with other users of OpenIFS, NEMO, FESOM who are not using ESM-Tools.

Cheers Joakim

pgierz commented 4 years ago

Hi @joakimkjellsson and @JanStreffing, we have these general shell plug-ins now. I haven't used them in any of my runs, but do you think this can be a solution to this issue?

joakimkjellsson commented 4 years ago

Hi @pgierz The plugins definitely do the trick, but they do have the drawback that the path to plugins must be hardcoded in configs/esm_runscripts/esm_plugins.yaml. I would prefer if there was native support in ESM-Tools to run a bash script without plugins. I can imagine a few other models might need this as well. For instance, Sebastian has a nice bash script to compress NEMO output. That would be on my wish list, but I have no idea how hard/possible that is to do...

Cheers Joakim

pgierz commented 4 years ago

I have a way out of the hardcoding problem, you can add an "entry point" and then use pip install to make your plugin available. If you have a look at the fesom-mesh prep plugin in the plugin group, you can see what I mean. Still need to discuss that with @dbarbi once he's back from vacation.

In the end "everything" is a plugin. We just separate between "core" and "user" built ones. The basic scheme is always to pass the config around. On my wish list is to define the recipe for how to run the model individually for each setup, that would make everything a bit easier to handle. Right now, I need to comment out your preprocess step to be able to run awiesm. Not ideal...

overall though, I think this is moving in the right direction! :-)

dbarbi commented 3 years ago

as this is a big topic we applied for a grant from Helmholtz Metadata Collaboration to solve this

pgierz commented 3 years ago

Closing this, post-processing can be integrated into Dirk's new workflow manager which will be available for everyone in release 6.

esm-tools / esm_tools

Spearation of information and processes, or the dilemma of postprocessing #31