ESMValGroup / ESMValTool

ESMValTool: A community diagnostic and performance metrics tool for routine evaluation of Earth system models in CMIP
https://www.esmvaltool.org
Apache License 2.0
221 stars 128 forks source link

Application of the broken recipe policy for releases #3065

Closed remi-kazeroni closed 1 year ago

remi-kazeroni commented 1 year ago

Is your feature request related to a problem? Please describe.

Since a few months, we have an official policy for broken recipes in our docs: https://docs.esmvaltool.org/en/latest/community/broken_recipe_policy.html. This states: affected recipe will be listed under “broken recipes” in the Changelog, together with the version number of the last known release in which the recipe was still working. If a recipe continues to be broken for three releases of the ESMValTool (about one year) and no recipe maintainer could be found during this time, the affected recipe and diagnostics will be retired.

I'd like to discuss a bit here how to apply our policy in practice. We have a few cases where the recipes are known to be broken, issues have been opened with the next milestone, maintenance has not happened for various reasons and issues are bumped to the next milestone. This makes the job of the release manager harder as the person runs these recipes without knowing these are broken, needs to figure out the problem again, scroll through issues and comments to check if that was known before, ...

For v2.8, I'm planning to add a first "broken recipes" section to the Changelog and list broken recipes. This might not be so readable in the long run if lists of broken recipes are split across changelog for different releases.

We discussed yesterday in a Tech Lead Team Meeting that it could be useful to have a full list/table of broken recipes in our documentation. This would help users when they try to run broken recipes and are not sure if they did something wrong. This would also simplifies the release process a bit and hopefully increase the willingness to maintain broken recipes. The table could be linked at the bottom of the recipe index: https://docs.esmvaltool.org/en/latest/recipes/index.html The table could look like:

Broken recipe Last known working version Problem GitHub issue
recipe_perfmetrics_land_CMIP5.yml v2.? ESMValCore issue https://github.com/ESMValGroup/ESMValTool/issues/2594
recipe_check_obs.yml v2.4? ESMValCore issue https://github.com/ESMValGroup/ESMValCore/issues/1388

Possible problems would be:

If the table gets too long, we could think of having one table per type of problems.

Also, it could be good practice to open an issue listing newly broken recipes once a release is done. By using milestones, we can check whether these recipes have received attention before reaching the limit of 3 releases. An example: https://github.com/ESMValGroup/ESMValTool/issues/2943

Would you be able to help out? Would you have the time and skills to implement the solution yourself?

Yes, I could add this table to the documentation and update the release instructions accordingly. But I'd be happy to get feedback from @ESMValGroup/esmvaltool-coreteam first 🍻

bettina-gier commented 1 year ago

I like this a lot Remi! One note would be to perhaps add another column of partially working or I don't know how to word it, seeing esmvalcore as the problem for both entries in the table sounds to most users as an unfixable issue, when reading the github issues reveals the problem in recipe_perfmetrics_land_CMIP5.yml to be in the handling of a single dataset which users might be fine with excluding for their purposes. Similarly, seems that only parts of recipe_check_obs.yml are affected by the issue of overwriting cmor table info when deriving custom variables. To me it seems these cases for users are easier to deal with if their focus wasn't on the broken parts, than when the problem lies in diagnostics not running leading to no output for everything.

On a side note if we change the file for the recipe index I'd be strongly in favour of adding the recipe names after the titles in brackets, as I know that at least I sometimes can't remember the titles but only recipe filenames making it harder to look them up =D

schlunma commented 1 year ago

Thanks @remi-kazeroni, these are really good suggestions. This will make the life of the release manager so much easier, and also helps users in choosing which recipes to run.

Two suggestions:

remi-kazeroni commented 1 year ago

I have created a link that is set to point to the overview table of successful/failed recipe runs for the last release. That could also be used to inform our users which recipes can't be run with the stable release of ESMValTool: https://esmvaltool.dkrz.de/shared/esmvaltool/stable_release/debug.html

alistairsellar commented 1 year ago

Thanks @remi-kazeroni it sounds like a good approach to me. To confirm, I am OK with recipe_autoassess_landsurface_soilmoisture being marked as broken under this policy, and accept the badge of shame that comes with it. :sad:. Let me know if I need to do anything associated with this, like updating the doc page.

I like the suggestion of adding relevant statement to doc page. In this case I could make it clear that the recipe only works on Jasmin, and that there is a plan to fix it for other machines.

alistairsellar commented 1 year ago

Sorry, some weird side effect of copy-and-paste there - was not intending to close the issue! Was trying to reply constructively!

remi-kazeroni commented 1 year ago

Thanks @remi-kazeroni it sounds like a good approach to me. To confirm, I am OK with recipe_autoassess_landsurface_soilmoisture being marked as broken under this policy, and accept the badge of shame that comes with it. :sad:. Let me know if I need to do anything associated with this, like updating the doc page.

I like the suggestion of adding relevant statement to doc page. In this case I could make it clear that the recipe only works on Jasmin, and that there is a plan to fix it for other machines.

Thanks a lot for your opinion on this @alistairsellar. I agree it would probably make sense to flag this recipe as broken with the explanation that it only runs on Jasmin at the moment. Hopefully, that could be improved in the future.

Would you know why the recipe_autoassess_landsurface_permafrost.yml runs fine (see test) even if the same climatology files seem to be required? Could it be that the climatology files are actually not used?

In both cases the recipes already mention that the files can only be accessed on Jasmin but the doc pages (here and here) could probably be clarified. It would be great if you could take a look.

remi-kazeroni commented 1 year ago

After the first round of recipe testing for the release v2.8 (see #3076), it seems that we have only 2 families of recipes that can't be run due to not publicly available data: recipe_autoassess_landsurface_*.yml and recipe_schlund20jgr_gpp*. What is important from a release point of view is that these 2 families of recipes do not get testing again and again if that implies extra work for each release manager to find the missing data, contact the maintainers of these recipes, investigate the same issues during each round of recipe testing.

To handle these recipes, I would propose one of the two following approaches:

  1. flag the corresponding recipes as broken. This could give some incentive to fix the recipes (e.g. finding a solution to make all necessary files publicly available). This seems to be the approach favoured for recipe_autoassess_landsurface_*.yml (see this https://github.com/ESMValGroup/ESMValTool/issues/3065#issuecomment-1460526522)
  2. remove these recipes from the list of recipes to be tested for releases. This can be easily done in the scripts that we've started to use to run all recipes at once (see #2883). But in such cases, it should be properly documented in the online docs and in the recipe which files are missing and who should be contacted for questions (e.g. the maintainer). It would be also the responsibility of the maintainer to test their recipes with release candidate for the Core and the release manager does not have to do it any more. This seems to be the approach favoured for recipe_schlund20jgr_gpp* (see this https://github.com/ESMValGroup/ESMValTool/issues/3076#issuecomment-1460244896).

I think we need to be pragmatic here. The goal is to make the release process as simple as possible. The discussion on how to deal with "missing data recipes" should be handled on a case by case basis between the maintainers, the Science and the Tech Lead Teams.

valeriupredoi commented 1 year ago

Thanks @remi-kazeroni it sounds like a good approach to me. To confirm, I am OK with recipe_autoassess_landsurface_soilmoisture being marked as broken under this policy, and accept the badge of shame that comes with it. :sad:. Let me know if I need to do anything associated with this, like updating the doc page.

I like the suggestion of adding relevant statement to doc page. In this case I could make it clear that the recipe only works on Jasmin, and that there is a plan to fix it for other machines.

@alistairsellar @remi-kazeroni let's mark that as broken, ok. But for the love of motor racing gods, let's get that data be public before 2.9 - am looking at you, Met Office :grin:

alistairsellar commented 1 year ago

Would you know why the recipe_autoassess_landsurface_permafrost.yml runs fine (see test) even if the same climatology files seem to be required? Could it be that the climatology files are actually not used?

Yes, you are right. It turns out that it is not actually dependent on Jasmin - only the soilmoisture one is.

In both cases the recipes already mention that the files can only be accessed on Jasmin but the doc pages (here and here) could probably be clarified. It would be great if you could take a look.

Yes, they do need clarified: to emphasise the dependency in one case, and to remove the implication of a dependency in the other. I've opened #3103 for this. Soilmoisture documentation now mentions (in #3103) that the recipe is broken, as part of highlighting Jasmin dependency. But for application of broken recipe policy, should the word "broken" also be in bold or something at the top of the documentation page? Suggestions welcome (or feel free to edit branch directly).

bouweandela commented 1 year ago

I think we need to be pragmatic here. The goal is to make the release process as simple as possible. The discussion on how to deal with "missing data recipes" should be handled on a case by case basis between the maintainers, the Science and the Tech Lead Teams.

It would be good to talk about this again at the next workshop and see if we can update our policy to capture the ideas people have on this.

bouweandela commented 1 year ago

add another column of partially working or I don't know how to word it, seeing esmvalcore as the problem for both entries in the table sounds to most users as an unfixable issue, when reading the github issues reveals the problem in recipe_perfmetrics_land_CMIP5.yml to be in the handling of a single dataset which users might be fine with excluding for their purposes.

If a recipe is failing because of a single dataset, we could just remove it so the recipe works again.