ESMValGroup / ESMValTool

ESMValTool: A community diagnostic and performance metrics tool for routine evaluation of Earth system models in CMIP
https://www.esmvaltool.org
Apache License 2.0
211 stars 122 forks source link

Recipe test results for v2.10 #3463

Closed bouweandela closed 6 months ago

bouweandela commented 7 months ago

Recipe test results for v2.10

Here is an overview of the tests done for releasing v2.10. The results are available in https://esmvaltool.dkrz.de/shared/esmvaltool/v2.10.0/debug.html.

Here is the conda environment.yml.

Recipe running session 2023-12-12

Recipes that ran successfully (133 out of 155)

Recipes that failed because the diagnostic script failed (4 out of 155)

Recipes that failed because of missing data (4 out of 155)

Recipes that failed because the run took too long (8 out of 155)

Recipes that failed because they used too much memory (4 out of 155)

Recipes that failed because of an HDF5 error (3 out of 155)

bouweandela commented 7 months ago

Unfortunately, the results were written to a scratch disk and left there for too long, resulting in part of them to be deleted. We will need to do a new run so this will take a bit longer.

valeriupredoi commented 7 months ago

great stuff, folks! Here's me looking at the odd 17 ones that died of unnatural causes, of various reasons (not data or diag):

schlunma commented 7 months ago

recipe_psyplot fails due to an issue of geos, which will apparantly be fixed in their new version 3.13 (we currently have 3.11).

valeriupredoi commented 7 months ago

good find @schlunma :beer: Our env is getting somewhat rather old, we need that Py312 support sooner than later (working on it for Core, stuck at prospector, the last hurdle)

valeriupredoi commented 7 months ago

lauer22_Fig5_lifrac fails to realize data from a (240, 91, 360, 720) array with dtype('float64') - needs be run on a hefty memory node

valeriupredoi commented 7 months ago

a lot of those 17 recipes, that failed with other reasons, simply die out - all goes fine until they just stop in their tracks - any info from the SLURM logs? It looks to me like they were just run on an interactive node and the user timed out, with the system killing the session, and implicitly, the running process

schlunma commented 7 months ago

I just successfully ran my 3 schlundjgr recipes, updated them in the table.

Regarding the NCL failures: Could you please cherry-pick https://github.com/ESMValGroup/ESMValTool/commit/1647d46ee3c5deb084fa4ac59024a46581d618c3 into the release branch? This needs to be in there (see https://github.com/ESMValGroup/ESMValTool/issues/3420). I just ran wenzel16jclim successfully with the current main branch.

bouweandela commented 7 months ago

Regarding the NCL failures: Could you please cherry-pick https://github.com/ESMValGroup/ESMValTool/commit/1647d46ee3c5deb084fa4ac59024a46581d618c3 into the release branch? This needs to be in there (see https://github.com/ESMValGroup/ESMValTool/issues/3420). I just ran wenzel16jclim successfully with the current main branch.

@schlunma I pulled this in, but it is giving me these issues:

$ cat /home/b/b381141/esmvaltool_output/esmvaltool-v2.10.x-2023-12-12/recipe_tebaldi21esd_20231213_193814/run/fig1b/plot_ts_line_mean_spread_pr/log.txt
 Copyright (C) 1995-2019 - All Rights Reserved
 University Corporation for Atmospheric Research
 NCAR Command Language Version 6.6.2
 The use of this software is governed by a License Agreement.
 See http://www.ncl.ucar.edu/ for more details.
INFO    Loading settings from /home/b/b381141/esmvaltool_output/esmvaltool-v2.10.x-2023-12-12/recipe_tebaldi21esd_20231213_193814/run/fig1b/plot_ts_line_mean_spread_pr/settings.ncl
INFO    Loading input data description from /home/b/b381141/esmvaltool_output/esmvaltool-v2.10.x-2023-12-12/recipe_tebaldi21esd_20231213_193814/preproc/fig1b/pr/pr_info.ncl
INFO     Wrote /home/b/b381141/esmvaltool_output/esmvaltool-v2.10.x-2023-12-12/recipe_tebaldi21esd_20231213_193814/plots/fig1b/plot_ts_line_mean_spread_pr//pr_ts_line_1850_2100.pdf
INFO    fatal: in log_provenance (interface_scripts/logging.ncl), outfile (path to figure) '/home/b/b381141/esmvaltool_output/esmvaltool-v2.10.x-2023-12-12/recipe_tebaldi21esd_20231213_193814/plots/fig1b/plot_ts_line_mean_spread_pr//pr_ts_line_1850_2100.pdf' does not exist (for PNGs, this function also searches for 'FILE.000001.png', 'FILE.000002.png', etc.); if no plot file is available use 'n/a'

and

$ cat /home/b/b381141/esmvaltool_output/esmvaltool-v2.10.x-2023-12-12/recipe_collins13ipcc_20231213_193815/run/ts_line_tas/ch12_plot_ts_line_mean_spread_tas/log.txt
 Copyright (C) 1995-2019 - All Rights Reserved
 University Corporation for Atmospheric Research
 NCAR Command Language Version 6.6.2
 The use of this software is governed by a License Agreement.
 See http://www.ncl.ucar.edu/ for more details.
INFO    Loading settings from /home/b/b381141/esmvaltool_output/esmvaltool-v2.10.x-2023-12-12/recipe_collins13ipcc_20231213_193815/run/ts_line_tas/ch12_plot_ts_line_mean_spread_tas/settings.ncl
INFO    Loading input data description from /home/b/b381141/esmvaltool_output/esmvaltool-v2.10.x-2023-12-12/recipe_collins13ipcc_20231213_193815/preproc/ts_line_tas/tas/tas_info.ncl
INFO     Wrote /home/b/b381141/esmvaltool_output/esmvaltool-v2.10.x-2023-12-12/recipe_collins13ipcc_20231213_193815/plots/ts_line_tas/ch12_plot_ts_line_mean_spread_tas//tas_ts_line_1850_2300.pdf
INFO    fatal: in log_provenance (interface_scripts/logging.ncl), outfile (path to figure) '/home/b/b381141/esmvaltool_output/esmvaltool-v2.10.x-2023-12-12/recipe_collins13ipcc_20231213_193815/plots/ts_line_tas/ch12_plot_ts_line_mean_spread_tas//tas_ts_line_1850_2300.pdf' does not exist (for PNGs, this function also searches for 'FILE.000001.png', 'FILE.000002.png', etc.); if no plot file is available use 'n/a'

with output_file_type: png in config-user.yml. The files with .png extension do exist, but somehow the code appears to be looking for pdf files?

bouweandela commented 7 months ago

@axel-lauer Could you please copy the file /work/bd0854/DATA/ESMValTool2/download/obs4MIPs/MODIS-1-0/v20180305/clt_mon_MODIS-1-0_BE_gn_200003-201109.nc to /work/bd0854/DATA/ESMValTool2/OBS/Tier1/MODIS-1-0/ on Levante? That will make it possible to run recipe_clouds_bias.yml and recipe_lauer13jclim again. Unfortunately, fully automatic download from ESGF does not work because the file has outdated facets on ESGF.

zklaus commented 7 months ago

The remaining NCL problems are due to a double guessing of the output filename. I think I fixed it locally, will commit soon.

zklaus commented 7 months ago

The NCL fix is in #3474.

valeriupredoi commented 7 months ago

can one of you pls have a look at Julia? I want to close (and will close) https://github.com/ESMValGroup/ESMValTool/issues/3287 since the Julia looks is the only thing outstanding there (even so, it has an issue to it)

bouweandela commented 7 months ago

The issue with recipe_julia.yml is still open, I suspect it doesn't correctly handle fill values since the scale is at 1e20.

valeriupredoi commented 7 months ago

lemme have a look at Julia then :grin:

valeriupredoi commented 7 months ago

ahaa! Figured out Julia! netCDF package loads missing values as 1e20s, whereas NCDatasets loads them as missing - will open Draft PR (I don't speak Julia so I couldn't make it work 100%)

bouweandela commented 7 months ago

Recipe running session 2023-12-14

I did a re-run of all recipes affected by late changes (i.e. all NCL recipes and a few other recipes with bug/data fixes).

Recipes that ran successfully (53 out of 155)

Recipes that failed because the diagnostic script failed (1 out of 155)

Recipes that failed because the run took too long (8 out of 155)

Recipes that failed because they used too much memory (1 out of 155)

Recipes that failed with HDF5 errors (3 out of 155)

axel-lauer commented 7 months ago

@axel-lauer Could you please copy the file /work/bd0854/DATA/ESMValTool2/download/obs4MIPs/MODIS-1-0/v20180305/clt_mon_MODIS-1-0_BE_gn_200003-201109.nc to /work/bd0854/DATA/ESMValTool2/OBS/Tier1/MODIS-1-0/ on Levante?

Done.

bouweandela commented 7 months ago

Some conclusions based on the above:

schlunma commented 7 months ago

Is the output of these tests available somewhere?

zklaus commented 7 months ago

Hm. I yesterday ran successfully two of the HDF5 problematic recipes (collins and tebaldi). I also ran successfully wenzel16jclim, though wenzel14jgr continues to fail also for me.

bouweandela commented 7 months ago

I have not uploaded it yet, but you can access it on Levante: slurm logs: /home/b/b381141/esmvaltool-v2.10.x-2023-12-14-logs esmvaltool output: /work/bd0854/b381141/esmvaltool_output/esmvaltool-v2.10.x-2023-12-12

bouweandela commented 7 months ago

Here is the HDF error if anyone is interested: stdout:

OSError: [Errno -101] NetCDF: HDF error: '/home/b/b381141/esmvaltool_output/esmvaltool-v2.10.x-2023-12-12/recipe_tebaldi21esd_20231214_174448/preproc/fig6c_IAV/tas/CMIP6_MRI-ESM2-0_Amon_piControl_r1i1p1f1_tas_gn_1850-2150.nc'

stderr:

  #000: H5F.c line 836 in H5Fopen(): unable to synchronously open file
    major: File accessibility
    minor: Unable to open file
  #001: H5F.c line 796 in H5F__open_api_common(): unable to open file
    major: File accessibility
    minor: Unable to open file
  #002: H5VLcallback.c line 3863 in H5VL_file_open(): open failed
    major: Virtual Object Layer
    minor: Can't open object
  #003: H5VLcallback.c line 3675 in H5VL__file_open(): open failed
    major: Virtual Object Layer
    minor: Can't open object
  #004: H5VLnative_file.c line 128 in H5VL__native_file_open(): unable to open file
    major: File accessibility
    minor: Unable to open file
  #005: H5Fint.c line 1873 in H5F_open(): unable to lock the file
    major: File accessibility
    minor: Unable to lock file
  #006: H5FD.c line 2034 in H5FD_lock(): driver lock request failed
    major: Virtual File Layer
    minor: Unable to lock file
  #007: H5FDsec2.c line 988 in H5FD__sec2_lock(): unable to lock file, errno = 11, error message = 'Resource temporarily unavailable'
    major: Virtual File Layer
    minor: Unable to lock file
schlunma commented 7 months ago

I messed up the check for file existence in https://github.com/ESMValGroup/ESMValTool/pull/3422, which gives these wrong NCL errors about non-existing plot files. Should be fixed by #3477 (I tested this successfully with wenzel14jgr, but since the other NCL errors are similar I guess those should run, too :crossed_fingers: )

bouweandela commented 7 months ago

Thanks, running them again now..

valeriupredoi commented 7 months ago

H5FDsec2.c line 988 in H5FD__sec2_lock(): unable to lock file, errno = 11, error message = 'Resource temporarily unavailable'

there is I/O toestepping going on - this is a fairly common HDF5 barf if the file is already opened by a process while another process is trying to open and read it/write to it - see eg https://github.com/h5py/h5py/issues/1066 and a bunch of other people complaining about it since years ago. We need to understand what process opens the file and what other process is trying to do the same, but I think that's probably just on SLURM and that's not gonna be a straightforward task. At least, I'd be up for it but not today, and not before Xmas :christmas_tree:

schlunma commented 7 months ago

Good news, all NCL recipes with diag failures except for recipe_russel18jgr run now :tada:

recipe_russel18jgr fails since some of the diagnostic don't write plots (but judging from the code they are supposed to do that). This has already been the case for the v2.9.0 release, but became evident now due to the changes in the NCL provenance code. I opened an issue here: https://github.com/ESMValGroup/ESMValTool/issues/3478

Since both maintainers of this recipe are not really active anymore, I suggest we flag this recipe as broken. @ESMValGroup/esmvaltool-coreteam opinions?

katjaweigel commented 7 months ago

It would be good, to get it running again, but I guess not worth/possible for the current release (since the missing figures were unnoticed/unreported for quite some time now).

schlunma commented 7 months ago

Here is a PR that fixes some issues in the russell recipe: https://github.com/ESMValGroup/ESMValTool/pull/3479. With this, all diagnostics except for the ones listed in #3478 work again.

bouweandela commented 7 months ago

Here are the results from the comparison with v2.9.

bouweandela commented 7 months ago

Here is a summary of the comparison results (full comparison is here). @ESMValGroup/esmvaltool-recipe-maintainers and @ESMValGroup/esmvaltool-coreteam If you have a bit of time, please check if the output of these recipes is still correct. Tick the box and add your name behind a recipe once you've checked.

Runs with v2.10: https://esmvaltool.dkrz.de/shared/esmvaltool/v2.10.0/ Runs with v2.9: https://esmvaltool.dkrz.de/shared/esmvaltool/v2.9.0/ Runs with v2.8 https://esmvaltool.dkrz.de/shared/esmvaltool/v2.8.0/ If the plots or data files are not shown on the recipe output webpage (this happens when provenance has not been implemented in the diagnostic script), you can still download them by clicking the 'figures' or 'data' links at the bottom of the page.

The recipes where plots are different are probably the most important to check because if the data are different but the plots still look the same the changes are probably not significant. Maybe we can refine the thresholds for when data is reported as different for a future version of the comparison tool.

Plots and data are different

Only plots are different

Only data are different

Comparison is done using numpy.allclose with the default tolerances for floating point numbers and numpy.array_equal for other data types.

Results are the same as v2.9

Unable to compare because no reference run for v2.9

katjaweigel commented 6 months ago

Is it possible to see the result/log of the comparison tool? Visually I don't find any difference in the figures of recipe_martin18grl.yml, but is has a lot of Figures, so I might have missed it.

bouweandela commented 6 months ago

Yes, they are posted in https://github.com/ESMValGroup/ESMValTool/issues/3463#issuecomment-1859257090.

recipe_martin18grl.yml: results differ from reference run
Reference run: /shared/esmvaltool/v2.9.0/recipe_martin18grl_20230704_162537
Current run: /shared/esmvaltool/v2.10.0/recipe_martin18grl_20231212_223414
Differing files:
  - plots/spi_collect/spi_collect/SPI_mapHistoric_Dur_of_Events_ACCESS1-0.png
  - plots/spi_collect/spi_collect/SPI_mapHistoric_Dur_of_Events_IPSL-CM5A-MR.png
  - plots/spi_collect/spi_collect/SPI_mapHistoric_Dur_of_Events_MPI-ESM-MR.png
  - plots/spi_collect/spi_collect/SPI_mapHistoric_No_of_Events_per_year_Observations.png
  - plots/spi_collect/spi_collect/SPI_mapHistoric_Sev_index_of_Events_GFDL-ESM2G.png
  - plots/spi_collect/spi_collect/SPI_mapObservations_Average_SPI_of_Events_Mean.png
  - plots/spi_collect2/spi_collect2/SPI_mapFuture_Avr_SPI_of_Events_GISS-E2-H.png
  - plots/spi_collect2/spi_collect2/SPI_mapFuture_Avr_SPI_of_Events_IPSL-CM5B-LR.png
  - plots/spi_collect2/spi_collect2/SPI_mapFuture_Dur_of_Events_CNRM-CM5.png
  - plots/spi_collect2/spi_collect2/SPI_mapFuture_Dur_of_Events_GFDL-ESM2G.png
  - plots/spi_collect2/spi_collect2/SPI_mapFuture_Dur_of_Events_HadGEM2-CC.png
  - plots/spi_collect2/spi_collect2/SPI_mapFuture_Dur_of_Events_IPSL-CM5A-LR.png
  - plots/spi_collect2/spi_collect2/SPI_mapFuture_Dur_of_Events_MPI-ESM-MR.png
  - plots/spi_collect2/spi_collect2/SPI_mapFuture_Dur_of_Events_MRI-ESM1.png
  - plots/spi_collect2/spi_collect2/SPI_mapFuture_Sev_index_of_Events_GFDL-ESM2G.png
  - plots/spi_collect2/spi_collect2/SPI_mapFuture_Sev_index_of_Events_IPSL-CM5B-LR.png
  - plots/spi_collect2/spi_collect2/SPI_mapFuture_Sev_index_of_Events_MRI-ESM1.png
  - plots/spi_collect2/spi_collect2/SPI_mapHistoric_Dur_of_Events_GFDL-ESM2G.png
  - plots/spi_collect2/spi_collect2/SPI_mapHistoric_Dur_of_Events_GISS-E2-H.png
  - plots/spi_collect2/spi_collect2/SPI_mapHistoric_Sev_index_of_Events_GFDL-CM3.png
  - plots/spi_collect2/spi_collect2/SPI_mapHistoric_Sev_index_of_Events_MRI-ESM1.png
  - plots/spi_collect2/spi_collect2/SPI_mapHistoric_Sev_index_of_Events_NorESM1-M.png

Thanks for checking!

katjaweigel commented 6 months ago

Thanks! (And sorry that I didn't get the idea to look at the post before.) The differences are tiny deviation in the way missing data are masked on the plot (really only visible, if switch between the two versions of these figures.)

valeriupredoi commented 6 months ago

autoassess, validation, and all @ledm 's oceans eleven look fine! Stellar work @bouweandela :beer:

bouweandela commented 6 months ago

Thanks, everyone! The release has now been published!

valeriupredoi commented 6 months ago

It's a Christmas miracle 🎄 🎅