ESMValGroup / ESMValTool

ESMValTool: A community diagnostic and performance metrics tool for routine evaluation of Earth system models in CMIP
https://www.esmvaltool.org
Apache License 2.0
217 stars 127 forks source link

[v2.11.0 release] Recipe failures on DKRZ: `OSError: [Errno -101] NetCDF: HDF error` #3702

Open ehogan opened 3 months ago

ehogan commented 3 months ago

Describe the bug Following ESMValCore v2.11.0rc2 testing, a few recipes are failing with OSError: [Errno -101] NetCDF: HDF error:

The first three were failing during ESMValTool v2.10.0 testing. The last recipe started failing during ESMValCore v2.11.0rc1 testing.

I intend to add these to the list of broken recipes via #3662.

bouweandela commented 3 months ago

@ehogan Have you tried running any of these recipes on a different machine, e.g. Jasmin? If I remember correctly, we previously thought the Levante filesystem caused these issues. I just ran recipe_preprocessor_derive_test.yml (except for the cmip6/toz variable, see #3709) on my laptop and it runs without HDF errors.

ehogan commented 3 months ago

@ehogan Have you tried running any of these recipes on a different machine, e.g. Jasmin? If I remember correctly, we previously thought the Levante filesystem caused these issues. I just ran recipe_preprocessor_derive_test.yml (except for the cmip6/toz variable, see #3709) on my laptop and it runs without HDF errors.

I haven't. I have just set up the other three recipes to run on JASMIN, but I won't have the results until tomorrow (given the timings from the v2.9.0 testing):

ehogan commented 3 months ago

I am struggling to get the first two recipes to run on JASMIN (they keep failing with what appear to be various memory related errors), and the third recipe failed due to missing data, even though I had search_esgf: when_missing set in the ESMValTool user configuration file 😞

ehogan commented 3 months ago

Even though there were memory errors, the recipe_collins13ipcc.yml recipe has just completed on JASMIN 🥳

2024-07-03 14:19:22,220 UTC [33611] INFO    Time for running the recipe was: 7:35:57.567925
2024-07-03 14:19:22,518 UTC [33611] INFO    Maximum memory used (estimate): 128.9 GB
[...]
2024-07-03 14:20:02,500 UTC [33611] INFO    Run was successful
ehogan commented 3 months ago

I am struggling to get the first two recipes to run on JASMIN (they keep failing with what appear to be various memory related errors), and the third recipe failed due to missing data, even though I had search_esgf: when_missing set in the ESMValTool user configuration file 😞

The second recipe also failed due to missing data 😞