ESMValGroup / ESMValTool

ESMValTool: A community diagnostic and performance metrics tool for routine evaluation of Earth system models in CMIP
https://www.esmvaltool.org
Apache License 2.0
224 stars 128 forks source link

Recipe testing and output comparison for release 2.9.0 #3239

Closed bouweandela closed 1 year ago

bouweandela commented 1 year ago

This issue documents recipe testing for the v2.9.0 release

Installation and environment

$ git clone git@github.com:ESMValGroup/ESMValTool
$ cd ~/ESMValTool
$ git checkout v2.9.x
$ mamba env create -n esmvaltool-v2.9.x-2023-06-20 -f environment.yml
$ conda activate esmvaltool-v2.9.x-2023-06-20
$ pip install .
$ cd ..
$ git clone git@github.com:scitools/iris
$ cd iris
$ git checkout v3.6.x
$ pip install .

Environment file

environment.txt

Config user file

All options at the default for Levante except search_esgf: when_missing.

bouweandela commented 1 year ago

Overview of the results

Webpage: https://esmvaltool.dkrz.de/shared/esmvaltool/v2.9.0rc1/debug.html

Successful: 135 out of 153

Recipes that failed with DiagnosticError

Recipes that failed of Missing Data

Recipes that failed because they used too much memory (probably due to https://github.com/SciTools/iris/issues/5338)

bouweandela commented 1 year ago

And here are the results of the comparison with the results for v2.8.0. 61 recipes have different results, which seems like a lot. I will have a preliminary look at them and report some results here later.

valeriupredoi commented 1 year ago

61 is not a lot - it's in fact bang-on, I think I had some 50-60 myself for 2.7 and Remi had a bucketload too; Matplotlib changed so expect plots differing only. Great work, bud :beers:

valeriupredoi commented 1 year ago

well, rainfarm died bc you didn't install Julia; where are @schlunma 's schlund*gpp ones that failed?

valeriupredoi commented 1 year ago

carvalhais looks like a proper bug in iris - cube.core_data() when data is just a point, I'll have a closer look

valeriupredoi commented 1 year ago

Both lauer22 didn't actually exit - they seem to have been killed by SLRM mid-analysis, either node failure or memory-related? Same with mpqb and sea surface salinity = gonna update your list there

remi-kazeroni commented 1 year ago

Great work with this recipe testing, @bouweandela!

Both lauer22 didn't actually exit - they seem to have been killed by SLRM mid-analysis, either node failure or memory-related? Same with mpqb and sea surface salinity = gonna update your list there

Indeed, we have 7 recipes (last section of this https://github.com/ESMValGroup/ESMValTool/issues/3239#issuecomment-1602719758) who couldn't be run due to time/memory limitations even if those could be run in the previous release. Possible reasons:

And 4-5 diagnostic failures that we should try to fix together next week 🍻

valeriupredoi commented 1 year ago

so @ledm has just reported to CEDA/JASMIN that a bunch of his runs get kicked out of SLURM on basis of resources but obv nothing out of the ordinary (ie mem consumption order 2GB) - he's still to confirm that he's using the latest Core, but the symptoms look identical to these recipes here that didn't exit, but rather, were kicked out by the scheduler

valeriupredoi commented 1 year ago

carvalhais looks like a proper bug in iris - cube.core_data() when data is just a point, I'll have a closer look

lemme look into this now :portugal:

valeriupredoi commented 1 year ago

OK @ledm is running with latest:

esmvalcore                2.9.0.dev28+g179312e41          pypi_0    pypi
esmvaltool                2.9.0.dev32+ge17d1c8f6          pypi_0    pypi

and is seeing exacly what Bouwe saw for lauer22 etc - bizarre kickouts

ledm commented 1 year ago

Okay, glad it's not just me. Though I think that some of my recent issues originated with https://github.com/ESMValGroup/ESMValTool/issues/3240 and https://github.com/ESMValGroup/ESMValCore/pull/2110

ledm commented 1 year ago

The file that get kills immediately every time on jasmin for me is: https://github.com/ESMValGroup/ESMValTool/blob/09c18280751323354b4f0aa1f0ef6e260589326d/esmvaltool/recipes/ai_mpa/recipe_ocean_ai_mpa_po4.yml

valeriupredoi commented 1 year ago

@bouweandela carvalhais goes through fine with iris=3.6.0 (deployed) and 3.6.x (pip-installed) - that's either a fluke or something related to distributed (which I did not run with); same dask as yours too; try rerunning the recipe, methinks

ledm commented 1 year ago

I'm still struggling with this. I've changed the preprocessing stage of the recipe (removing custom order), I've tried a few older versions of esmvalcore (2.8.0, 2.8.1, and 2.8.1rc1)? No matter what, I'm still getting the "killed" message after 0.3GB of RAM usage.

bouweandela commented 1 year ago

@ledm Would it be OK to open a separate issue for your recipe? This issue is about testing recipes for the upcoming v2.9 release. Re the recipes listed above that are running out of memory: this is probably due to scitools/iris#5338.

ledm commented 1 year ago

Okay, I'm able to recreate the error! It looks like this may be something else? It seems like it's a extract_levels preprocessor bug/

Running interactively on jasmin:

import iris
from esmvalcore.preprocessor._regrid import extract_levels

fn = "/badc/cmip6/data/CMIP6/CMIP/MOHC/UKESM1-0-LL/historical/r2i1p1f2/Omon/po4/gn/v20190708/po4_Omon_UKESM1-0-LL_historical_r2i1p1f2_gn_200001-201412.nc"

cube= iris.load_cube(fn)
c2 = extract_levels(cube, scheme='nearest', levels = [0.1 ,])

Results in Killed, I'm on 3 for 3 times now in esmvalcore v2.8.1, so there's that! I'll try other schemes and see if they are more resilient.

Let me know if this needs to go into it's own issue page!

Hope you all have lots of fun this week in Sweden!

Edit: This is killed for all four schemes and different levels:

c2 = extract_levels(cube, scheme='nearest', levels = [0.1 ,]) # killled
c2 = extract_levels(cube, scheme='nearest', levels = [0.5 ,])# killled
c2 = extract_levels(cube, scheme='linear', levels = [0.5 ,]) # killled
c2 = extract_levels(cube, scheme='nearest_extrapolate', levels = [0.5 ,]) # killled
c2 = extract_levels(cube, scheme='linear_extrapolate', levels = [0.5 ,])# killled

Edit2: The error occurs here: https://github.com/ESMValGroup/ESMValCore/blob/1101d36e3f343ec823842ea7c3f4b941ee942a89/esmvalcore/preprocessor/_regrid.py#L870

    # Now perform the actual vertical interpolation.
    new_data = stratify.interpolate(levels,
                                    src_levels_broadcast,
                                    cube.core_data(),
                                    axis=z_axis,
                                    interpolation=interpolation,
                                    extrapolation=extrapolation)

I'm running stratify version '0.3.0'. Is that correct? Who runs stratify?

ledm commented 1 year ago

Looks like stratify is one of @bjlittle's projects? Any idea what could be causing this bug, Bill?

ledm commented 1 year ago

Ooof, turns out stratify is a mix of C and python, and I'm not capable of debugging that!

Think my backup solution is to go back to this solution.

https://github.com/ESMValGroup/ESMValCore/issues/1039

bouweandela commented 1 year ago

Let me know if this needs to go into it's own issue page!

Yes, please

ledm commented 1 year ago

Done: https://github.com/ESMValGroup/ESMValTool/issues/3244

bouweandela commented 1 year ago

I ran the comparison tool again because there was an issue with it. NaN values did not compare as equal in arrays, so the tool flagged too many recipe runs as changed. The new comparison results are available here. Now the number of recipe runs that are different from v2.8.0 is 61. I will add a more detailed comparison once it is ready.

bouweandela commented 1 year ago

Detailed comparison, including all kinds of information on what exactly changed in the output is available here. Please scroll through the document to find your recipe. The indentation indicates what recipe/variable/coordinate the information applies to.

valeriupredoi commented 1 year ago

@bouweandela carvalhais goes through fine with iris=3.6.0 (deployed) and 3.6.x (pip-installed) - that's either a fluke or something related to distributed (which I did not run with); same dask as yours too; try rerunning the recipe, methinks

@bouweandela have you rerun this recipe to have it finish OK, by any chance?

remi-kazeroni commented 1 year ago

@bouweandela carvalhais goes through fine with iris=3.6.0 (deployed) and 3.6.x (pip-installed) - that's either a fluke or something related to distributed (which I did not run with); same dask as yours too; try rerunning the recipe, methinks

@bouweandela have you rerun this recipe to have it finish OK, by any chance?

Just did with iris 3.6.1 and I'm still getting the same error.

remi-kazeroni commented 1 year ago

By increasing the resources (from interactivate to compute partition, or from compute to compute 512G), I was able to run successfully a few more recipes: recipe_climate_change_hotspot, recipe_mpqb_xch4, recipe_sea_surface_salinity, recipe_thermodyn_diagtool. But these recipes have longer runtimes and memory consumption than found during v2.8 release (factor >2 ). I could run these recipes successfully before and after the latest iris PR was merged without any differences in terms of performance.

schlunma commented 1 year ago

By increasing the resources (from interactivate to compute partition, or from compute to compute 512G), I was able to run successfully a few more recipes: recipe_climate_change_hotspot, recipe_mpqb_xch4, recipe_sea_surface_salinity, recipe_thermodyn_diagtool. But these recipes have longer runtimes and memory consumption than found during v2.8 release (factor >2 ). I could run these recipes successfully before and after the latest iris PR was merged without any differences in terms of performance.

I re-run some of these recipes with a distributed scheduler using the following .esmvaltool/dask.yml file:

cluster:
  type: distributed.LocalCluster
  n_workers: 8
  threads_per_worker: 4
  memory_limit: 32 GiB

(no idea how optimal those settings are).

The runs that failed exited with errors like concurrent.futures._base.CancelledError: compute_and_return_warnings-3ee342d0-affa-4e69-9f2a-09fc9e7cbe92 during the save step. I also see a lot of warnings like Sending large graph of size ... or full garbage collections took XX% CPU time recently in these runs, so my guess is that some of the preprocessors used in these recipes are not fully lazy yet.

Bottom line: I think for the fully lazy preprocessors we can achieve a similar (hopefully a much better with smarter configuration) run time and memory usage when using dask distributed, but the non-lazy preprocessors crash with that.

bouweandela commented 1 year ago

I tested recipe_daily_era5.yml with both iris 3.4 and 3.6.1 without configuring Dask Distributed and it runs fine with iris 3.4 and it runs out of memory with 3.6.1.

valeriupredoi commented 1 year ago

carvalhais goes through with dask=2023.3.0 but fails a couple lines down the diag at plotting:

Traceback (most recent call last):
  File "/home/b/b382109/ESMValTool/esmvaltool/diag_scripts/land_carbon_cycle/diag_global_turnover.py", line 819, in <module>
    main(config)
  File "/home/b/b382109/ESMValTool/esmvaltool/diag_scripts/land_carbon_cycle/diag_global_turnover.py", line 757, in main
    _plot_single_map(plot_path_mod, tau_ctotal,
  File "/home/b/b382109/ESMValTool/esmvaltool/diag_scripts/land_carbon_cycle/diag_global_turnover.py", line 654, in _plot_single_map
    plt.imshow(_get_data_to_plot(_dat.data),
  File "/home/b/b382109/miniconda3/envs/tool-latest/lib/python3.11/site-packages/matplotlib/pyplot.py", line 2695, in imshow
    __ret = gca().imshow(
            ^^^^^^^^^^^^^
  File "/home/b/b382109/miniconda3/envs/tool-latest/lib/python3.11/site-packages/cartopy/mpl/geoaxes.py", line 318, in wrapper
    return func(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/b/b382109/miniconda3/envs/tool-latest/lib/python3.11/site-packages/cartopy/mpl/geoaxes.py", line 1331, in imshow
    img, extent = warp_array(img,
                  ^^^^^^^^^^^^^^^
  File "/home/b/b382109/miniconda3/envs/tool-latest/lib/python3.11/site-packages/cartopy/img_transform.py", line 192, in warp_array
    array = regrid(array, source_native_xy[0], source_native_xy[1],
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/b/b382109/miniconda3/envs/tool-latest/lib/python3.11/site-packages/cartopy/img_transform.py", line 278, in regrid
    _, indices = kdtree.query(target_xyz, k=1)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "_ckdtree.pyx", line 795, in scipy.spatial._ckdtree.cKDTree.query
ValueError: 'x' must be finite, check for nan or inf values

so cartopy is breaking it now

valeriupredoi commented 1 year ago

the problem with downgrading dask is that we actually have to isolate a bug at their end, but beats me if I could do that - what I did was I saved the scalar cube that was causing the issue, and tried to repeat the operation that was not liked by dask:

import iris

def make_cube():
    """Make a scalar cube and save it."""
    cube = iris.load_cube("tauxxx.nc")
    return cube

def main():
    """Run the show."""
    cube = make_cube()
    cube.data = cube.lazy_data()
    print(cube.has_lazy_data())
    print(cube)
    x = float(cube.core_data())
    print(x)

if __name__ == "__main__":
    main()

problem is, that works well, no issue. Note that I needed to force the lazy data bc the cube in the diag has indeed lazy data, whereas if you load an already scalar cube it has realized data. Am stumped :confounded:

valeriupredoi commented 1 year ago

carvalhais goes through with dask=2023.3.0 but fails a couple lines down the diag at plotting:

Traceback (most recent call last):
  File "/home/b/b382109/ESMValTool/esmvaltool/diag_scripts/land_carbon_cycle/diag_global_turnover.py", line 819, in <module>
    main(config)
  File "/home/b/b382109/ESMValTool/esmvaltool/diag_scripts/land_carbon_cycle/diag_global_turnover.py", line 757, in main
    _plot_single_map(plot_path_mod, tau_ctotal,
  File "/home/b/b382109/ESMValTool/esmvaltool/diag_scripts/land_carbon_cycle/diag_global_turnover.py", line 654, in _plot_single_map
    plt.imshow(_get_data_to_plot(_dat.data),
  File "/home/b/b382109/miniconda3/envs/tool-latest/lib/python3.11/site-packages/matplotlib/pyplot.py", line 2695, in imshow
    __ret = gca().imshow(
            ^^^^^^^^^^^^^
  File "/home/b/b382109/miniconda3/envs/tool-latest/lib/python3.11/site-packages/cartopy/mpl/geoaxes.py", line 318, in wrapper
    return func(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/b/b382109/miniconda3/envs/tool-latest/lib/python3.11/site-packages/cartopy/mpl/geoaxes.py", line 1331, in imshow
    img, extent = warp_array(img,
                  ^^^^^^^^^^^^^^^
  File "/home/b/b382109/miniconda3/envs/tool-latest/lib/python3.11/site-packages/cartopy/img_transform.py", line 192, in warp_array
    array = regrid(array, source_native_xy[0], source_native_xy[1],
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/b/b382109/miniconda3/envs/tool-latest/lib/python3.11/site-packages/cartopy/img_transform.py", line 278, in regrid
    _, indices = kdtree.query(target_xyz, k=1)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "_ckdtree.pyx", line 795, in scipy.spatial._ckdtree.cKDTree.query
ValueError: 'x' must be finite, check for nan or inf values

so cartopy is breaking it now

this is signaled and raised here https://github.com/SciTools/cartopy/issues/2199 and it looks like they already have a PR to fix it :partying_face:

bouweandela commented 1 year ago

Summary of relevant differences:

Recipes where the plots are identical, but (some) data differs

For the above recipes, I suspect we might want to try if we can make the comparison if the data is the same more tolerant, since the plots are considered identical enough.

The recipes below need a check by the @ESMValGroup/esmvaltool-recipe-maintainers (please let us know if you are no longer interested in being a maintainer, e.g. by commenting in this issue):

Recipes where (some) plots differ, but not the data

(Some) plots and data are different

The results from running the recipes with v2.8.0 of the tool are available here and the results from running them with v2.9.0 of the tool are available here.

You can see very specifically what is different in the comparison tool output files: detailed comparison part 1 and detailed comparison part 2.

bouweandela commented 1 year ago

@ESMValGroup/technical-lead-development-team Because we're seeing a considerable performance regression for some recipes when using the basic/default Dask scheduler, I would propose that we make the release with iris pinned to v3.4 or greater instead of 3.6 or later. That would allow users who are really keen on using the basic scheduler to do so. Please let me know if there's objections to that.

ledm commented 1 year ago

Out of interest, what was the last ESMValTool version that uses iris 3.4?

remi-kazeroni commented 1 year ago

Out of interest, what was the last ESMValTool version that uses iris 3.4?

For the release of v2.8.0, we used iris-3.4.1 to test all recipes

katjaweigel commented 1 year ago

@bouweandela thanks for the list! The plot from recipe_spei.yml only changed the colour (because R chooses different ones every time, I should set fixed ones there but I guess there is not enough time to change that before the release now). For recipe_flato13ipcc_figures_938_941_cmip3.yml, recipe_flato13ipcc_figures_938_941_cmip6.yml and recipe_weigel21gmd_figures_13_16.yml there is a somewhat different setting of the fx files now (see #3215), which explains the changed values.

bouweandela commented 1 year ago

Thanks for checking @katjaweigel! I was planning to make the release on Thursday next week, so if you still want to fix the plotting in recipe_spei.yml before that time you can do it.

remi-kazeroni commented 1 year ago

Regarding the list of "Recipes that failed because they used too much memory" (in this https://github.com/ESMValGroup/ESMValTool/issues/3239#issuecomment-1602719758), I was able to run them successfully on Levante, except one, by increasing the computational resources required (see #3216). This does not really tell us where the problems come from but will nevertheless help the RM who gets less failures.

The biggest increase of resources required is for the 2 cloud recipes. Using 1024 GB compute nodes and max_parallel_tasks=3, I was able to run recipe_lauer22jclim_fig5_lifrac (Time for running the recipe was: 5:20:57.188249, Maximum memory used (estimate): 881.6 GB) but not recipe_lauer22jclim_fig3-4_zonal (ends with 1 tasks running, 1 tasks waiting for ancestors, 14/16 done after the 8h limit). FYI, @axel-lauer.

bouweandela commented 1 year ago

We did another round of test runs. The results are available here: https://esmvaltool.dkrz.de/shared/esmvaltool/v2.9.0/

Here is the environment.yml and here are the comparison results and the detailed comparison results. The comparison was done with the previous round of testing for this release. Because of disk space limitations, I had to remove the v2.9.0rc1 results from the webpage, but the results are still available in /work/bd0854/b381141/esmvaltool_output/esmvaltool-v2.9.x-2023-06-20 for those who have access to Levante.

All recipes ran successfully except for:

valeriupredoi commented 1 year ago

@bouweandela here's your Portuguese issue, bud https://github.com/ESMValGroup/ESMValTool/issues/3281

valeriupredoi commented 1 year ago

I think we can safely close this now that we have #3287 - @bouweandela @remi-kazeroni since you put so much effort in it, I'd feel bad closing it myself, will you do the service pls :medal_military:

maritsandstad commented 1 year ago

Hi, I've been away on vacation and didn't see the mention to check differing plots before now. To me it looks like only consecutive dry days have changed, and I wondered if it's just that the version of consecutive dry days has changed from the one that caps the number at the end of a year to the one that runs over multiple years when the longest dry period extends across calendar years. That would explain the plot discrepancy, they weird end of the original plot (in 2.8), and would anyway be the more sensible thing to have plotted this way. I.e. I think that this is an improvement, then.