ESMValGroup / ESMValCore

ESMValCore: A community tool for pre-processing data from Earth system models in CMIP and running analysis scripts.
https://www.esmvaltool.org
Apache License 2.0
40 stars 36 forks source link

Recipe test results for ESMValCore v2.11.0rc1 #2421

Closed chrisbillowsMO closed 1 week ago

chrisbillowsMO commented 1 month ago

Recipe test results for v2.11.0rc1

This is the initial output from testing done for releasing ESMValCore v2.11.0rc1. Please see the following comment for our evaluation of the failures.

Recipe running session 2024-05-15

Setup

mamba version

levante5> mamba --version
mamba 1.5.8
conda 24.5.0

ESMValTool version

levante5> esmvaltool version
ESMValCore: 2.11.0rc1
ESMValTool: 2.11.0.dev75+g4734caf5a.d20240515

Recipes that ran successfully (132 out of 160)

Click to expand - recipe_albedolandcover.yml - recipe_anav13jclim.yml - recipe_arctic_ocean.yml - recipe_autoassess_landsurface_permafrost.yml - recipe_autoassess_landsurface_soilmoisture.yml - recipe_autoassess_landsurface_surfrad.yml - recipe_autoassess_stratosphere.yml - recipe_bock20jgr_fig_1-4.yml - recipe_bock20jgr_fig_6-7.yml - recipe_capacity_factor.yml - recipe_climate_change_hotspot.yml - recipe_climwip_brunner2019_med.yml - recipe_climwip_brunner20esd.yml - recipe_climwip_test_basic.yml - recipe_climwip_test_performance_sigma.yml - recipe_clouds_bias.yml - recipe_clouds_ipcc.yml - recipe_cmug_h2o.yml - recipe_concatenate_exps.yml - recipe_consecdrydays.yml - recipe_correlation.yml - recipe_cox18nature.yml - recipe_cvdp.yml - recipe_daily_era5.yml - recipe_deangelis15nat.yml - recipe_deangelis15nat_fig1_fast.yml - recipe_decadal.yml - recipe_diurnal_temperature_index.yml - recipe_eady_growth_rate.yml - recipe_ecs.yml - recipe_ecs_constraints.yml - recipe_ecs_scatter.yml - recipe_ensclus.yml - recipe_era5-land.yml - recipe_esacci_lst.yml - recipe_esacci_oc.yml - recipe_extract_shape.yml - recipe_extreme_index.yml - recipe_eyring06jgr.yml - recipe_flato13ipcc_figure_914.yml - recipe_flato13ipcc_figure_924.yml - recipe_flato13ipcc_figure_942.yml - recipe_flato13ipcc_figure_945a.yml - recipe_flato13ipcc_figure_96.yml - recipe_flato13ipcc_figure_98.yml - recipe_flato13ipcc_figures_926_927.yml - recipe_flato13ipcc_figures_92_95.yml - recipe_flato13ipcc_figures_938_941_cmip3.yml - recipe_flato13ipcc_figures_938_941_cmip6.yml - recipe_galytska23jgr.yml - recipe_gier2020bg.yml - recipe_globwat.yml - recipe_heatwaves_coldwaves.yml - recipe_hydro_forcing.yml - recipe_hype.yml - recipe_iht_toa.yml - recipe_impact.yml - recipe_ipccwg1ar6ch3_fig_3_42_b.yml - recipe_ipccwg1ar6ch3_fig_3_43.yml - recipe_ipccwg1ar6ch3_fig_3_9.yml - recipe_kcs.yml - recipe_landcover.yml - recipe_lauer13jclim.yml - recipe_lauer22jclim_fig1_clim.yml - recipe_lauer22jclim_fig1_clim_amip.yml - recipe_lauer22jclim_fig2_taylor.yml - recipe_lauer22jclim_fig2_taylor_amip.yml - recipe_lauer22jclim_fig6_interannual.yml - recipe_lauer22jclim_fig7_seas.yml - recipe_lauer22jclim_fig8_dyn.yml - recipe_lauer22jclim_fig9-11c_pdf.yml - recipe_li17natcc.yml - recipe_lisflood.yml - recipe_marrmot.yml - recipe_meehl20sciadv.yml - recipe_model_evaluation_basics.yml - recipe_model_evaluation_clouds_clim.yml - recipe_model_evaluation_clouds_cycles.yml - recipe_model_evaluation_precip_zonal.yml - recipe_modes_of_variability.yml - recipe_monitor.yml - recipe_monitor_with_refs.yml - recipe_mpqb_xch4.yml - recipe_multimodel_products.yml - recipe_my_personal_diagnostic.yml - recipe_ncl.yml - recipe_ocean_Landschuetzer2016.yml - recipe_ocean_amoc.yml - recipe_ocean_bgc.yml - recipe_ocean_example.yml - recipe_ocean_ice_extent.yml - recipe_ocean_multimap.yml - recipe_ocean_scalar_fields.yml - recipe_perfmetrics_CMIP5.yml - recipe_perfmetrics_CMIP5_4cds.yml - recipe_perfmetrics_land_CMIP5.yml - recipe_preprocessor_test.yml - recipe_psyplot.yml - recipe_pv_capacity_factor.yml - recipe_python.yml - recipe_python_for_CI.yml - recipe_quantilebias.yml - recipe_r.yml - recipe_radiation_budget.yml - recipe_rainfarm.yml - recipe_runoff_et.yml - recipe_russell18jgr.yml - recipe_schlund20jgr_gpp_abs_rcp85.yml - recipe_schlund20jgr_gpp_change_1pct.yml - recipe_schlund20jgr_gpp_change_rcp85.yml - recipe_sea_surface_salinity.yml - recipe_seaborn.yml - recipe_seaice.yml - recipe_seaice_drift.yml - recipe_seaice_feedback.yml - recipe_shapeselect.yml - recipe_smpi.yml - recipe_smpi_4cds.yml - recipe_snowalbedo.yml - recipe_spei.yml - recipe_tcr.yml - recipe_thermodyn_diagtool.yml - recipe_toymodel.yml - recipe_validation.yml - recipe_validation_CMIP6.yml - recipe_variable_groups.yml - recipe_weigel21gmd_figures_13_16.yml - recipe_wenzel14jgr.yml - recipe_wenzel16nat.yml - recipe_wflow.yml - recipe_williams09climdyn_CREM.yml - recipe_zmnam.yml

Recipes that failed because the diagnostic script failed (11 out of 160)

Recipes that failed because of missing data (3 out of 160)

Recipes that failed because the run took too long (6 out of 160)

Recipes that failed of other reasons or are still running (7 out of 160)

Recipes that are known to be broken (1 out of 160)

chrisbillowsMO commented 1 month ago

Hi @ESMValGroup/technical-lead-development-team @bouweandela @valeriupredoi

Any comments on the following evaluation please? (The original output from running the recipes for the first time is above).

1. R diagnostic failures

The following are R recipes with various errors. Would anyone with R knowledge please take a look?

The errors were either of the below:

Error in (models_dataset == reference_dataset) && (models_exp == reference_exp) :
  'length = 2' in coercion to 'logical(1)'
                     ^ Operator >remapcon2< not found!

2. Python diagnostic failures

We have the capacity to address these errors - should we? Or does anyone already know how to solve these?

KeyError: 'Provenance record for /scratch/b/b382148/esmvaltool_output/recipe_martin18grl_20240515_142625/plots/spi_collect/spi_collect/SPI_time_series_Bremen_Observations.png already exists.'
iris.exceptions.ConcatenateError: failed to concatenate into a single cube.
  Cube metadata differs for phenomenon: precipitation_flux
TypeError: unhashable type: 'CubeAttrsDict'

3. NCL diagnostic failures

There is one NCL recipe with an error. Would anyone with NCL knowledge please take a look?

INFO    fatal: in uajet_sh850, cannot read plev and latrange

4. Recipes that failed because of missing data

We recognise recipe_check_obs.yml is a known broken recipe but should we open a new issue to resolve the missing data issues with ESMValGroup/obs-maintainers?

5. Recipes that failed because the run took too long

We've increased the time on all of these except for recipe_ipccwg1ar6ch3_fig_3_42_a.yml which was already at the maximum time. Is there anything we can do about this?

We also had to increase time on these from the "Recipes that failed of other reasons or are still running" section.

6. Recipes that failed because model data couldn't be downloaded

7. Recipes that failed because of an HDF5 error

This three are all the same as in v2.10 recipe test results

This is a new entry.

8. Recipes that fail because of - we think! - an ESMValCore issue

ValueError: Chunks and shape must be of the same length/dimension. Got chunks=(), shape=(1,)
valeriupredoi commented 1 month ago

great summary and work @chrisbillowsMO and @ehogan :beer:

Here is the issue with those three HDF5-related failures, as posted by @bouweandela back in December last year, when they were working on the 2.10 release: https://github.com/ESMValGroup/ESMValTool/issues/3463#issuecomment-1857587917

This is a HDF5 thread unsafe-related issue and it is flaky but it appears it is mostly reproducible (positive flakiness, or was it negative? don't matter). This has to be fixed, most probably by adding a file lock() statement somewhere; I'll have a look myself, but don't set it as roadblock towards the release IMO

bouweandela commented 1 month ago

This Julia recipe has the following error:

recipe_rainfarm.yml

ERROR: LoadError: ArgumentError: Package YAML [ddb6d928-2868-570f-bddf-ab3f9cf99eb6] is required but does not seem to be installed:

Did you install the Julia dependencies?

valeriupredoi commented 1 month ago

fairly sure no is the answer to that q, bud :grin:

ehogan commented 1 month ago

This Julia recipe has the following error: recipe_rainfarm.yml ERROR: LoadError: ArgumentError: Package YAML [ddb6d928-2868-570f-bddf-ab3f9cf99eb6] is required but does not seem to be installed:

Did you install the Julia dependencies?

No, I had missed the esmvaltool install Julia step. Both Julia recipes now succeed, so I will update the first and second comments to reflect this 👍

schlunma commented 1 month ago

10. Recipes that never ran

* recipe_schlund20jgr_gpp_abs_rcp85.yml

* recipe_schlund20jgr_gpp_change_1pct.yml

* recipe_schlund20jgr_gpp_change_rcp85.yml

These have been excluded from the generate.py script. @schlunma might you need to run these?

Successfully tested them 👍 I'll update the comment above to reflect this.

ehogan commented 1 month ago

5. Recipes that failed because the run took too long

  • recipe_climate_change_hotspot.yml
  • recipe_eyring06jgr.yml
  • recipe_eyring13jgr_12.yml
  • recipe_ipccwg1ar6ch3_fig_3_19.yml
  • recipe_ipccwg1ar6ch3_fig_3_42_a.yml
  • recipe_ipccwg1ar6ch3_fig_3_42_b.yml
  • recipe_lauer22jclim_fig5_lifrac.yml

We've increased the time on all of these except for recipe_ipccwg1ar6ch3_fig_3_42_a.yml which was already at the maximum time. Is there anything we can do about this?

  • recipe_carvalhais14nat.yml
  • recipe_lauer22jclim_fig9-11ab_scatter.yml

We also had to increase time on these from the "Recipes that failed of other reasons or are still running" section.

The following recipes are now running successfully, so I will update the comments above:

Should I update the time for these recipes in SPECIAL_RECIPES in generate.py?

What should we do with the recipes that don't run within 8 hours?

ehogan commented 1 month ago

6. Recipes that failed because they used too much memory

  • recipe_model_evaluation_basics.yml

We've increased the memory on this one.

The following recipe is now running successfully, so I will update the comments above:

2024-05-16 09:28:34,122 UTC [86954] INFO    Time for running the recipe was: 0:01:42.672771
2024-05-16 09:28:34,977 UTC [86954] INFO    Maximum memory used (estimate): 73.2 GB
[...]
2024-05-16 09:28:35,092 UTC [86954] INFO    Run was successful

This is a new recipe since ESMValTool v2.10.0, so it will need adding to SPECIAL_RECIPES in generate.py.

ehogan commented 1 month ago

@bouweandela, @valeriupredoi, would it be possible to get some guidance on what to do now, please? How many of the failures above must we fix before moving onto the ESMValTool freeze and testing stages? Can all the diagnostic and data issues wait until ESMValTool testing? 🤔

valeriupredoi commented 1 month ago

Super work, guys! Here's me 3 cents (2 cents adjusted for inflation):

schlunma commented 1 month ago

A possible reason for some of these failures could be iris' new attribute handling: since version 3.8, iris now distinguishes between local and global attributes. We adopted this new behavior in https://github.com/ESMValGroup/ESMValCore/pull/2398.

This was the reason for the errors in recipe_schlund20esd.yml (fixed in https://github.com/ESMValGroup/ESMValTool/pull/3605) and recipe_wenzel16jclim.yml (fixed in https://github.com/ESMValGroup/ESMValTool/pull/3603).

ehogan commented 1 month ago

Super work, guys! Here's me 3 cents (2 cents adjusted for inflation):

Apologies @valeriupredoi, you did say this previously, and I promptly forgot! I will update the comment above appropriately 👍

valeriupredoi commented 1 month ago

Not a worry, Emma, release time is a very busy one 🙂

bouweandela commented 1 month ago

@bouweandela, @valeriupredoi, would it be possible to get some guidance on what to do now, please? How many of the failures above must we fix before moving onto the ESMValTool freeze and testing stages? Can all the diagnostic and data issues wait until ESMValTool testing? 🤔

If you suspect it is an ESMValCore issue, I would recommend fixing it before moving on to testing ESMValTool, but otherwise you should be fine to move on.

Should I update the time for these recipes in SPECIAL_RECIPES in generate.py?

Yes, that would be helpful for the next release manager.

What should we do with the recipes that don't run within 8 hours?

Are these recipes still running after 8 hours? In my experience, sometimes processes get killed without SLURM telling you. If there are no more log messages in the debug log or diagnostic scripts logs long before the 8 hours are over, it seems likely that the process has silently crashed. If this is the case, you could try reducing the number of workers used by Dask. This can be done by configuring the distributed scheduler, or if there are non-lazy preprocessor functions #674 in the recipe, you can use the default scheduler and create a file called ~/.config/dask/dask.yml and put

num_workers: 16

in it. That will use just 16 threads instead of the default 128 on a default levante compute node, leaving 256GB/16 = 16GB of RAM per thread instead of just 2GB.

ehogan commented 1 week ago

Closing this issue in favour of #2468 😊