New multimodel module can be slow and buggy for certain recipes

valeriupredoi commented 3 years ago

Unfortunately I must consider reverting to the older multimodel module asap - the new module is EXTREMELY slow, buggy in different spots and not that much more efficient for memory usage. Take this recipe for instance:

# ESMValTool
# recipe_capacity_factor.yml
---
documentation:
  description: |
     Solar Capacity Factor

  authors:
    - cionni_irene

  maintainer:
    - weigel_katja

  references:
    - bett2016renene
    - weigel2021gmd

  projects:
    - crescendo

datasets:
  - {dataset: ERA-Interim, project: OBS6, type: reanaly, version: 1, tier: 3,
     start_year: 1980, end_year: 2005}
  #- {dataset: ACCESS1-0, project: CMIP5, exp: historical, ensemble: r1i1p1,
  #   start_year: 1980, end_year: 2005}
  - {dataset: ACCESS1-3, project: CMIP5, exp: historical, ensemble: r1i1p1,
     start_year: 1980, end_year: 2005}
  #- {dataset: CanESM2, project: CMIP5, exp: historical, ensemble: r1i1p1,
  #   start_year: 1980, end_year: 2005}
  #- {dataset: CMCC-CM, project: CMIP5, exp: historical, ensemble: r1i1p1,
  #   start_year: 1980, end_year: 2005}
  - {dataset: CNRM-CM5, project: CMIP5, exp: historical, ensemble: r1i1p1,
     start_year: 1980, end_year: 2005}
  #- {dataset: CSIRO-Mk3-6-0, project: CMIP5, exp: historical, ensemble: r1i1p1,
  #   start_year: 1980, end_year: 2005}
  #- {dataset: GFDL-ESM2G, project: CMIP5, exp: historical, ensemble: r1i1p1,
  #   start_year: 1980, end_year: 2005}
  - {dataset: IPSL-CM5A-MR, project: CMIP5, exp: historical, ensemble: r1i1p1,
     start_year: 1980, end_year: 2005}
  #- {dataset: MIROC5, project: CMIP5, exp: historical, ensemble: r1i1p1,
  #   start_year: 1980, end_year: 2005}
  - {dataset: MPI-ESM-MR, project: CMIP5, exp: historical, ensemble: r1i1p1,
     start_year: 1980, end_year: 2005}
  #- {dataset: MRI-CGCM3, project: CMIP5, exp: historical, ensemble: r1i1p1,
  #   start_year: 1980, end_year: 2005}
  #- {dataset: NorESM1-M, project: CMIP5, exp: historical, ensemble: r1i1p1,
  #   start_year: 1980, end_year: 2005}
preprocessors:
  preproc:
    regrid:
      target_grid: reference_dataset
      scheme: linear
    extract_region:
      start_longitude: -20
      end_longitude: 60
      start_latitude: 30
      end_latitude: 80
#     extract_season: &season
#       season: djf
# Multi_model_statistics turned of due to an issue with the Core v2.3.0
    multi_model_statistics:
      span: overlap
      statistics: [mean]
      exclude: [reference_dataset]

diagnostics:
  capacity_factor:
    description: Calculate the wind power capacity factor.
    variables:
      tas:
        reference_dataset: ERA-Interim
        preprocessor: preproc
        mip: day
      rsds:
        reference_dataset: ERA-Interim
        preprocessor: preproc
        mip: day
    scripts: null
      #main:
      #  # <<: *season
      #  season: all
      #  script: pv_capacityfactor/pv_capacity_factor.R
      #  maxval_colorbar: 0.15

this is the chopped up recipe that @katjaweigel uses to report a suite of issues in https://github.com/ESMValGroup/ESMValCore/issues/1198 - the whole thing runs in 7min with esmvalcore=2.2.0 (about 3min max for each of the multimodel bits), with current implement it takes 15min to fail (not even completing the first multimodel bit). Memory-wise the 2.2.0 module consumed max 9GB whereas the new module consumed 5GB so not that much better. Fails are at different stress points;
one of the models (ACCESS1-3) has a different AuxCoord: height than the others (at 1.5m compared to 2m) so slice merge fails because of that
cubes entered in the cube list comprehension at line 274 have different shapes, importantly different time axis lengths, and indexing on an index that is outside the axis length crashes the code there

I propose we retire the current implement to a development branch and temporarily re-instate the 2.2.0 module

zklaus commented 3 years ago

First, let's try to stay factual and polite. SHOUTING is not generally considered that.

On the points:

I think it is (almost) meaningless to look at runtime and, for that matter, memory before the two approaches are successfully doing the same thing.
For the different heights, I like the new approach more than the old one. Silently ignoring a coordinate that is different doesn't seem correct to me. Perhaps a tas fix for Access with a warning output could be a good solution? The CMIP5 data request unfortunately is a bit soft with

normally, the temperature should be reported at the 2 meter height
The last point is interesting. Why do the cubes end up with different lengths? This smells like a bug and should be treated as such.

katjaweigel commented 3 years ago

@zklaus: The different length of the time axis seems to be connected to leap years. I just wrote in the other issue: The reason for the

WARNING [32819] /work/bd0854/b380216/anaconda3/envs/esmvaltool202106/lib/python3.9/site-packages/iris/coords.py:1979: UserWarning: Collapsing a non-contiguous coordinate. Metadata may not be fully descriptive for 'time'.
warnings.warn(msg.format(self.name()))

Warning and the

IndexError: index 9490 is out of bounds for axis 0 with size 9490

Error is most probably, that there are is a different number of days in the 26 years the recipe tries to look at, either 9490 days or 9497 days (probably depending on if the calendar contains leap years or not). I'll try if "regrid_time" would help.

katjaweigel commented 3 years ago

(esmvaltool202006) b380216@mistralpp3% grep -in Running.extract_region.air_temperature.*time: /work/bd1083/b380216/output/recipe_pv_capacity_factor_20210629_095915/run/main_log_debug.txt
6449:2021-06-29 10:00:43,142 UTC [9314] DEBUG esmvalcore.preprocessor:279 Running extract_region(air_temperature / (K) (time: 9490; latitude: 241; longitude: 480) 6731:2021-06-29 10:01:01,890 UTC [9314] DEBUG esmvalcore.preprocessor:279 Running extract_region(air_temperature / (K) (time: 9497; latitude: 241; longitude: 480) 7020:2021-06-29 10:01:06,680 UTC [9314] DEBUG esmvalcore.preprocessor:279 Running extract_region(air_temperature / (K) (time: 9497; latitude: 241; longitude: 480) 7593:2021-06-29 10:01:32,682 UTC [9314] DEBUG esmvalcore.preprocessor:279 Running extract_region(air_temperature / (K) (time: 9497; latitude: 241; longitude: 480) 8138:2021-06-29 10:02:06,893 UTC [9314] DEBUG esmvalcore.preprocessor:279 Running extract_region(air_temperature / (K) (time: 9497; latitude: 241; longitude: 480) 8733:2021-06-29 10:02:30,240 UTC [9314] DEBUG esmvalcore.preprocessor:279 Running extract_region(air_temperature / (K) (time: 9490; latitude: 241; longitude: 480) 9052:2021-06-29 10:02:45,243 UTC [9314] DEBUG esmvalcore.preprocessor:279 Running extract_region(air_temperature / (K) (time: 9497; latitude: 241; longitude: 480) 9736:2021-06-29 10:03:08,329 UTC [9314] DEBUG esmvalcore.preprocessor:279 Running extract_region(air_temperature / (K) (time: 9490; latitude: 241; longitude: 480) 10297:2021-06-29 10:03:46,899 UTC [9314] DEBUG esmvalcore.preprocessor:279 Running extract_region(air_temperature / (K) (time: 9490; latitude: 241; longitude: 480) 11117:2021-06-29 10:04:26,199 UTC [9314] DEBUG esmvalcore.preprocessor:279 Running extract_region(air_temperature / (K) (time: 9497; latitude: 241; longitude: 480) 11743:2021-06-29 10:04:47,619 UTC [9314] DEBUG esmvalcore.preprocessor:279 Running extract_region(air_temperature / (K) (time: 9490; latitude: 241; longitude: 480) 12691:2021-06-29 10:05:41,717 UTC [9314] DEBUG esmvalcore.preprocessor:279 Running extract_region(air_temperature / (K) (time: 9497; latitude: 241; longitude: 480) 13764:2021-06-29 10:06:30,602 UTC [9314] DEBUG esmvalcore.preprocessor:279 Running extract_region(air_temperature / (K) (time: 9490; latitude: 241; longitude: 480)

valeriupredoi commented 3 years ago

yes @katjaweigel that's exactly that, well spotted! About @zklaus point about run times, I agree, in principle, if it was a run time difference of tens of percent, but a factor 4 or 5 is unacceptable. I am aware of this happening in certain situations but I've done quite a few tests prior to this being merged and noticed it consistently, albeit those tests were done with significantly lower amounts of data. This is a solid case with plenty of data, and unfortunately this runtime difference is a showstopper in my books, especially that the gain from less required memory is relatively marginal :+1:

valeriupredoi commented 3 years ago

For the different heights, I like the new approach more than the old one. Silently ignoring a coordinate that is different doesn't seem correct to me. Perhaps a tas fix for Access with a warning output could be a good solution? The CMIP5 data request unfortunately is a bit soft with

agreed! but we need a checker inside multimodel to either ignore or fix that AuxCoord (or others that might differ). Failing on a coordinate that is not used in the statistic itself is not the desired behavior

zklaus commented 3 years ago

On the runtime: @valeriupredoi, your opinion is noted and will of course be taken into consideration.

On the AuxCoord: This coordinate is used. What is the correct height of the resulting data? 2.0m? 1.5m? The weighted average? I don't see that the multimode preprocessor is in a good position to make that call. Since the overwhelming majority of models does use 2m and this is the suggested height of the standards, the best approach to me seems to add a fix to ACCESS-1.3 to change the height coordinate in the tas* variable(s). One might consider a correction to the data itself, but I leave this to the atmospheric experts.

On the leap day issue: How did the old preprocessor handle this? The only clear solution seems to be some form of time regridding, but I think that should be done by the user explicitly, thus raising an error (though possibly with a better message), is the right thing, imho.

katjaweigel commented 3 years ago

Currently, the preprocessor setting:

regrid_time:
  frequency: day

Doesn't solve the leap year issue, its fails with:

IndexError: list index out of range
2021-06-29 12:27:34,816 UTC [8169] INFO esmvalcore._main:444

and there are still cubes with 9490 and 9497 days.

bouweandela commented 3 years ago

not that much more efficient for memory usage.

Note that work on the performance is still in progress in #968, no performance improvement should be expected from the current implementation. The improvement that is included in main branch at the moment is that it uses iris instead of numpy for computing the statistics. As expected, this will lead to more strict handling of metadata. We will probably need some additional fixes for that, e.g. I proposed #1177 to address some of these issues.

valeriupredoi commented 3 years ago

the old preprocessor would find the overlap via np.intersect1d and would chop the data around it (if overlap was the setting), I think this new one does it in a similar manner but looks at the start/end points without checking that the points inside the overlap interval are the same (or time points arrays have equal lengths)

bettina-gier commented 3 years ago

If we're talking issues with the new mmm handling that weren't there before the refactoring, I also found one testing recipe_perfmetrics_land_CMIP5.yml as mentioned in this comment: https://github.com/ESMValGroup/ESMValTool/issues/2198#issuecomment-870450689 After regridding not all models had bounds for the longitude coordinate, making iris fail the cube merge. If I add a guess_bounds in there, the problem is solved. Seems to me that for now with all these issues arising, it would be better to revert the change, until all mmm recipes are tested and potential issues compiled and looked at. Now, this might highlight some things we need to change in other areas of the core - but considering the release timing a revert is the most plausible thing.

valeriupredoi commented 3 years ago

not that much more efficient for memory usage.

Note that work on the performance is still in progress in #968, no performance improvement should be expected from the current implementation. The improvement that is included in main branch at the moment is that it uses iris instead of numpy for computing the statistics. As expected, this will lead to more strict handling of metadata. We will probably need some additional fixes for that, e.g. I proposed #1177 to address some of these issues.

right on, but how can you explain the slowness?

valeriupredoi commented 3 years ago

If we're talking issues with the new mmm handling that weren't there before the refactoring, I also found one testing recipe_perfmetrics_land_CMIP5.yml as mentioned in this comment: ESMValGroup/ESMValTool#2198 (comment) After regridding not all models had bounds for the longitude coordinate, making iris fail the cube merge. If I add a guess_bounds in there, the problem is solved. Seems to me that for now with all these issues arising, it would be better to revert the change, until all mmm recipes are tested and potential issues compiled and looked at. Now, this might highlight some things we need to change in other areas of the core - but considering the release timing a revert is the most plausible thing.

good call @bettina-gier :+1:

zklaus commented 3 years ago

Note that all the problems we have discussed so far (including the one brought up by @bettina-gier) are pre-existing bugs that have passed us unnoticed until now. Reverting would therefore not solve any issue but just cover up wrong results. I am at the moment not convinced that is the right way to go about it.

Let's try to focus our energy now on understanding the problems and move to a solution strategy after that.

valeriupredoi commented 3 years ago

Klaus, those are not masked problems but rather things we don't really care for enough to fix them in the old module and we don't care enough to cause failures in the new module. We have a broken MM module on several aspects and the time we need to address those aspects is much longer than the time to revert and have a working module. We care about functionality foremost and then about improving performance and fixing knickknacks - not that the performance of the new MM is better, as Bouwe says, it's still work in progress

zklaus commented 3 years ago

V, I disagree with that assessment and don't think you can speak for all of us here.

As a point in case, trying to advance the issue of the uneven time axis, MOHC's HadGEM2-ES uses a 360 day calendar. Over the 25 year period of this recipe, that's a difference of 255+25/4=131 days or over 4 months of data that get chopped off the end of all* models when you add HadGEM to the list of datasets. Worse, since in all calendars the time is kept as days since a reference point, the same time point (as integer number of days since the reference) refers to different days of the year, towards the end of the period in completely different seasons that all get mixed up here. Whether Summer days get mixed into Winter or Winter days into Summer depends on the position of HadGEM in the list of datasets.

I think scientists care about this sort of thing.

valeriupredoi commented 3 years ago

I don't speak for the others - I speak for the core functionality of a key module in ESMValTool which is currently busted and it needs serious repairs to get both its functionality and performance to standards. Interim solution is to replace it with the old module which is known to work (and has been tested against reference results) while we perform the needed fixes and performance optimizations, it's as simple as that

zklaus commented 3 years ago

I gave a concrete example of where the old module seems to lead to very wrong results. Do you think that example is wrong? Do you think people are aware of this?

valeriupredoi commented 3 years ago

seems is the key word, the module has been tested thoroughly against reference results, unlike the new one. Have a go using the revert branch when you have some time and if that's a bug indeed, then we list it here and we weigh in the arguments for or against reverting

valeriupredoi commented 3 years ago

OK @zklaus @katjaweigel @bettina-gier I have got the revert done in https://github.com/ESMValGroup/ESMValCore/pull/1202 (just need to fix the docs build but that's be done if we decide to revert) - if you guys want to test/or just run your recipes to actually get results, go for it! I think we need to talk about this asap, lucky we have the ESMValTool call on Thu, @bouweandela sign me up for this pls :grin:

Peter9192 commented 3 years ago

Sorry to hear that you're experiencing issues with the updated multi-model code (but good to see some real use case testing coming in!) My intention has always been for the code to be more transparent, so that it will be easier to understand, debug, and optimise. While I understand it's annoying that this recipe doesn't work, let's not rush into reverting the update.

I agree with @zklaus that we should prefer solving these bugs that are surfacing right now properly, rather than hiding them away. In the old branch, there were no tests, notes, docstrings, etc. that would inform fellows about these "intentional workarounds". In updating the mmstats code over the past year or so, we stumbled upon many of these implicit assumptions, e.g. also in #685, and these things keep surfacing all the time, which is just as annoying, just not all at once.

With respect to the specific issues mentioned here:

On the leap years, see: https://github.com/ESMValGroup/ESMValCore/issues/937 and the docstring that was added in #685: https://github.com/ESMValGroup/ESMValCore/blob/ccadc6ea614091aca9b96c6f4d9266ec1994ae9d/esmvalcore/preprocessor/_multimodel.py#L210-L211
Regarding the desired behaviour for inconsistent coordinates, we had a discussion here with no consensus: https://github.com/ESMValGroup/ESMValCore/issues/956
I can imagine it would be nice if iris would allow merging cubes in the same way xarray does it, with a join argument that controls the behaviour. See http://xarray.pydata.org/en/stable/generated/xarray.concat.html. E.g. it could fill the cube with NaNs if a specific dataset doesn't have a certain coordinate value.
On the performance: that's disappointing. I'd expect a little overhead from iris, but not all that much. Would be interesting to analyze this further.

katjaweigel commented 3 years ago

@Peter9192 thanks for the summary and the links to the related issues! So far, I did not find any recipe that uses mip = day and multi-model statistics (recipe_pv_capacity_factor.yml is not merged yet and the multi-model mean was removed from the PR) but one never knows what happens in non merged recipes. I think the leap year issue could be solved through the preprocessor setting regrid_time? But currently that doesn't work (at least not with: regrid_time: frequency: day) see above. About all this issues I just wonder:

Why this was not marked as possible breaking changes to the development team (as for example this PR: https://github.com/ESMValGroup/ESMValCore/pull/1099)?
- Why not test all recipes with multi-model mean before the core release, if there is such a large change? (Even if the solution should be changes in the diagnostics in several cases, it is not clear if this is the case for all issues, several could as well require further changes in the core.)

Peter9192 commented 3 years ago

I think the leap year issue could be solved through the preprocessor setting regrid_time?

I think so too. Also relevant: https://github.com/ESMValGroup/ESMValCore/issues/744 (and see #781 for an overview of many related developments)

* Why this was not marked as possible breaking changes to the development team

Not sure what you mean by 'marking' here. I see no label or so on the PR you linked. I think we've made a lot of progress in improving the development process, but this could've been smoother, I agree.

* Why not test all recipes with multi-model mean before the core release, if there is such a large change?

I think there's several reasons for that. For one, there's no easy way to do that. That's why we've spent a lot of time in improving the tests and the test infrastructure (e.g. the testbot, the regression tests with sample data, and #1023). Bouwe is currently trying to set up infrastructure to run all recipes on Mistral, but this is still challenging. And for previous releases, we actually did a lot of manual tests with recipes, so I would have expected issues to surface during this release as well. But of course in this case it depends on the effort everyone puts into testing this with their frequently used recipes.

katjaweigel commented 3 years ago

@Peter9192 Thanks for your answer! With marking I meant just to write "@ESMValGroup/esmvaltool-developmentteam" somewhere in the discussion, as for the issue I mentioned above. Then everybody in the development team gets an email to be aware of the discussion. I think there was the idea communicated during the last ESMValTool workshop that this should be done for potential "breaking changes", i.e. changes in the core which could lead to recipes, that don't run any more.

valeriupredoi commented 3 years ago

cheers @Peter9192 :beer: Here's how I see it:

the new MM module is a visible improvement on many aspects compared to the previous implement, but needs thorough testing both to confirm its functionality and performance (tweaking the performance is ongoing, as Bouwe points out); I must confess I should have tested it much more than I did, but I can say the same about all you folk :grin:
we want it over the old implement asap
but as things stand now it isn't mature and tested enough, so it will need a period of time until this is done ("shakedown"); while this is happening I strongly suggest we revert to the older module because that has enough functionality and enough good performance for users to run it and not get unexpected errors and performance issues; this shakedown period may last a couple weeks or even more, once it's over and we're happy with the new MM, we release a bugfix release; this might be good to have anyway to get all the minor issues with the fx handling fixed in it too

If we decide to use the older v2.2.0 MM module we should have a quick and dirty v2.3.1 out asap, followed by the bugfix release 2.3.2 when the new MM (and fx stuff) have been merged

axel-lauer commented 3 years ago

I like the idea of going back to the old version and waiting until the "shakedown" is finished. A bugfix release once we are happy with the new version sounds like the way to go.

stefsmeets commented 3 years ago

Would it be an idea to have both functions side by side? i.e. users can call multimodel_statistics for the old implementation (I recommend a very annoying deprecation warning!), and multimodel_statistics2 for the new implementation.

Then you can have the benefit of having no breakage of the old recipes giving users some time to migrate, while also having the new implementation available in the release for the shakedown. Once the new implementation has matured, the old version can be removed.

zklaus commented 3 years ago

Here is how I see it:

The new version works correctly according to the implementor(s) of the old version since it passes their tests that guarantees it.
It improves the integrity and has already uncovered two bugs in recipes:
- one, where incorrectly data from different heights is being averaged as if it was at the same height and
- one, where data from different days is averaged and passed of as data from the same day
Despite leaving out the actual performance tuning, it already decreases the memory usage a bit
There is some concern about the time requirements, though the evidence seen so far is anecdotal and of very short times (only a few minutes). This might be worth investigating, but even then might be a price worth paying for the improved integrity.

The recipes should surely be fixed, no matter what we decide about the multimodel preprocessor.

valeriupredoi commented 3 years ago

Would it be an idea to have both functions side by side? i.e. users can call multimodel_statistics for the old implementation (I recommend a very annoying deprecation warning!), and multimodel_statistics2 for the new implementation.

Then you can have the benefit of having no breakage of the old recipes giving users some time to migrate, while also having the new implementation available in the release for the shakedown. Once the new implementation has matured, the old version can be removed.

that's the right way from a software cycle point, absolutely, and in fact it'd be easy since we don't have to change anything in Tool - it'd also allow us to do a direct apples-to-apples comparison between the two modules, I like the idea :+1:

katjaweigel commented 3 years ago

@zklaus To "fix the recipes" one probably needs to do additional fixes in the core in most cases:

As far as I understood the bounds issue from @bettina-gier, this should be fixed in the core.
The only way to "fix" the leap year issue is at the moment, not to use the multi-model mean for daily data (what was done for the recipe where I encountered it). To really fix it, we would probably need a completely new version of regrid_time (preprocessor, so belongs to the core as well). If I understood it correctly, regrid_time only resamples the time keeping the number of points at the moment. If we would like to really regrid daily data onto the same grid, we would need possibilities to interpolate as for the horizontal regridding.
For the issue with the 2m or 1.5m height a model fix was suggested (which would be part of the core). But I'm actually not sure, if that would be a better solution, because:
- A model fix would change the height to 2m also if no multi-model mean is applied.
- It is difficult, to calculate the right tas for 2m (If you use a climatology/fixed value the error will probably depend on latitude, land/sea, "weather", .... I'm not sure how well you could approximate it with the help of other variables: I guess the height grid for ta is too coarse to help, rather hfls and/or ts could be used for it, but I'm not sure how well this would work and if it is a good idea to use additional variables for fixes since it would make tas (2m) for such models to kind of a derived variable.)
- The difference for tas (2m) and tas (1.5m) is probably tiny in most case.
- Not to use tas for a multi-model mean from models where it is not given for 2m is probably not a popular idea.
- Therefore it would be maybe better to somehow allow to average 1.5m and 2m tas in case of multi-model means and give out a warning? (On the other hand nobody reads warnings and there might be other similar but larger differences.)

zklaus commented 3 years ago

There certainly may be changes necessary in the core and a bugfix release is a real possibility. We may even decide to revert the multi-model statistics if weighing the issues and solutions makes that the best choice.

But we should be clear where the problems lie, so as

not to unfairly blame people that have, with a lot of work, improved our codebase
to avoid that people come away from this discussion with the expectation that somebody will fix the multi-model statistics and make the other problems go away
to make sure that we really address the actual problems.

* As far as I understood the bounds issue from @bettina-gier, this should be fixed in the core.

Yes. Longitudes and latitudes must have bounds. I am not sure whether this is a simple dataset issue that needs a fix or a bug in the regridding. That should be discussed in the appropriate issue.

* The only way to "fix" the leap year issue is at the moment, not to use the multi-model mean for daily data (what was done for the recipe where I encountered it). To really fix it, we would probably need a completely new version of regrid_time (preprocessor, so belongs to the core as well). If I understood it correctly, regrid_time only resamples the time keeping the number of points at the moment. If we would like to really regrid daily data onto the same grid, we would need possibilities to interpolate as for the horizontal regridding.

I think the issue is with different calendars, so at the moment one needs to avoid using multi-model statistics for daily data with mixed calendars. That seems quite sensible to me. A long-term solution involves probably conversions between calendars. This is commonly done by duplicating or removing days, but the details should be discussed elsewhere. I would probably rather place such a thing in a (very useful!) align_calendars or similar pre-processor.

* For the issue with the 2m or 1.5m height a model fix was suggested (which would be part of the core). But I'm actually not sure, if that would be a better solution, because:
  * A model fix would change the height to 2m also if no multi-model mean is applied.
  * It is difficult, to calculate the right tas for 2m (If you use a climatology/fixed value the error will probably depend on latitude, land/sea, "weather", .... I'm not sure how well you could approximate it with the help of other variables: I guess the height grid for ta is too coarse to help, rather hfls and/or ts could be used for it, but I'm not sure how well this would work and if it is a good idea to use additional variables for fixes since it would make tas (2m) for such models to kind of a derived variable.)
  * The difference for tas (2m) and tas (1.5m) is probably tiny in most case.

You are completely right about these points.

  * Not to use tas for a multi-model mean from models where it is not given for 2m is probably not a popular idea.

I'm not sure how much backlash there really would be. If it's only about a few CMIP5 datasets (that have a history of issues with their surface temperature), perhaps not many people care. Though this may be more widely spread.

  * Therefore it would be maybe better to somehow allow to average 1.5m and 2m tas in case of multi-model means and give out a warning? (On the other hand, nobody reads warnings and there might be other similar but larger differences.)

Perhaps. I am really not sure either way. But should that also be done for the other offered statistics? What about standard deviation or extreme percentiles?

Either way, again, let's discuss the right issue. If we decide that averaging 1.5m and 2m tas is the way to go, let's state it that way and make clear what exactly is the behavior we want.

ruthlorenz commented 3 years ago

For the issue with the 2m or 1.5m height a model fix was suggested (which would be part of the core). But I'm actually not sure, if that would be a better solution, because:

A model fix would change the height to 2m also if no multi-model mean is applied.

It is difficult, to calculate the right tas for 2m (If you use a climatology/fixed value the error will probably depend on latitude, land/sea, "weather", .... I'm not sure how well you could approximate it with the help of other variables: I guess the height grid for ta is too coarse to help, rather hfls and/or ts could be used for it, but I'm not sure how well this would work and if it is a good idea to use additional variables for fixes since it would make tas (2m) for such models to kind of a derived variable.)

The difference for tas (2m) and tas (1.5m) is probably tiny in most case.

Not to use tas for a multi-model mean from models where it is not given for 2m is probably not a popular idea.

Therefore it would be maybe better to somehow allow to average 1.5m and 2m tas in case of multi-model means and give out a warning? (On the other hand nobody reads warnings and there might be other similar but larger differences.)

I totally agree with Katja on the tas height issue. Actually, tas is defined as "near-surface (usually, 2 meter) air temperature". Some models give it at 1.5m (definitely ACCESS and I guess anything else using UM for the atmosphere, so UKESM, HadGEM etc.). So its quite a few models from the same family. Ignoring all of them in a multi-model mean because they define tas on a different height is not the way to go! This is not an error in the data. tas is a diagnostic anyway, and not a prognostic variable. If you need the prognostic variable and would care about the difference between 1.5 and 2 meters you need to use ta. So I would allow averaging 1.5m and 2m heights (simply ignore the height in this case would be my suggestion).

katjaweigel commented 3 years ago

@zklaus I'm not blaming the changes in the core for the issues themselves. I'm just not happy, that there were not enough test before the release to discover them and that there was no warning about possible "breaking changes". It would be good, if everything in the tool would be working, when it is released and at the moment that is not the case with the new core. So I wonder how that could be solved best? I don't think it is good if we have to reduce/change the models used in the recipes for new release (because of bounds, 2m height, new model fixes which fix the grid for one variable but make a derived variable impossible as it happened for one model in case of recipe_deangelis15nat.yml ), but that would be the only way to now to only change the tool to make it work with the released core. (And I'm not even sure if it would help in all cases.) Part of theses issues can always happen (the leap year issue currently probably only effects a non merged recipe in a PR, one never can avoid something like that completely), but for this release there are rather many of these issues and we probably would have seen several of them with more tests before the core release.

zklaus commented 3 years ago

@ruthlorenz, @katjaweigel, it's good to know that more models are affected (and I confirmed this for both CMIP5 and CMIP6). I opened a new issue to discuss specifically the question on how to treat this. It would be great if you could help to resolve the questions over there in #1204.

@katjaweigel, on tests vs releases, first, keep in mind that the current released version of the tool, 2.2.0, requires a core version <2.3.0, so there is no released version of the tool that has these problems. Whether a change is breaking or not can be difficult to determine until it breaks something, more so if the actual behavior is correct.

I agree that we should try to keep all models in the recipes, and even though that is sometimes not possible (for example, Manuel Schlund ran into a problem with a dataset that has meanwhile been retracted), just removing models because we can't deal with them anymore is certainly not the way to go.

zklaus commented 3 years ago

I think all issues raised here are now being addressed in other issues. An overview can be found in https://github.com/ESMValGroup/ESMValTool/discussions/2218. Closing this. If you disagree, please feel free to reopen with a comment explaining the need.

ESMValGroup / ESMValCore

New multimodel module can be slow and buggy for certain recipes #1201