Closed bouweandela closed 1 year ago
Webpage: https://esmvaltool.dkrz.de/shared/esmvaltool/v2.9.0rc1/debug.html
Successful: 135 out of 153
And here are the results of the comparison with the results for v2.8.0. 61 recipes have different results, which seems like a lot. I will have a preliminary look at them and report some results here later.
61 is not a lot - it's in fact bang-on, I think I had some 50-60 myself for 2.7 and Remi had a bucketload too; Matplotlib changed so expect plots differing only. Great work, bud :beers:
well, rainfarm died bc you didn't install Julia; where are @schlunma 's schlund*gpp
ones that failed?
carvalhais looks like a proper bug in iris - cube.core_data()
when data is just a point, I'll have a closer look
Both lauer22 didn't actually exit - they seem to have been killed by SLRM mid-analysis, either node failure or memory-related? Same with mpqb and sea surface salinity = gonna update your list there
Great work with this recipe testing, @bouweandela!
Both lauer22 didn't actually exit - they seem to have been killed by SLRM mid-analysis, either node failure or memory-related? Same with mpqb and sea surface salinity = gonna update your list there
Indeed, we have 7 recipes (last section of this https://github.com/ESMValGroup/ESMValTool/issues/3239#issuecomment-1602719758) who couldn't be run due to time/memory limitations even if those could be run in the previous release. Possible reasons:
And 4-5 diagnostic failures that we should try to fix together next week 🍻
so @ledm has just reported to CEDA/JASMIN that a bunch of his runs get kicked out of SLURM on basis of resources but obv nothing out of the ordinary (ie mem consumption order 2GB) - he's still to confirm that he's using the latest Core, but the symptoms look identical to these recipes here that didn't exit, but rather, were kicked out by the scheduler
carvalhais looks like a proper bug in iris -
cube.core_data()
when data is just a point, I'll have a closer look
lemme look into this now :portugal:
OK @ledm is running with latest:
esmvalcore 2.9.0.dev28+g179312e41 pypi_0 pypi
esmvaltool 2.9.0.dev32+ge17d1c8f6 pypi_0 pypi
and is seeing exacly what Bouwe saw for lauer22 etc - bizarre kickouts
Okay, glad it's not just me. Though I think that some of my recent issues originated with https://github.com/ESMValGroup/ESMValTool/issues/3240 and https://github.com/ESMValGroup/ESMValCore/pull/2110
The file that get kills immediately every time on jasmin for me is: https://github.com/ESMValGroup/ESMValTool/blob/09c18280751323354b4f0aa1f0ef6e260589326d/esmvaltool/recipes/ai_mpa/recipe_ocean_ai_mpa_po4.yml
@bouweandela carvalhais goes through fine with iris=3.6.0 (deployed) and 3.6.x (pip-installed) - that's either a fluke or something related to distributed (which I did not run with); same dask as yours too; try rerunning the recipe, methinks
I'm still struggling with this. I've changed the preprocessing stage of the recipe (removing custom order), I've tried a few older versions of esmvalcore (2.8.0, 2.8.1, and 2.8.1rc1)? No matter what, I'm still getting the "killed" message after 0.3GB of RAM usage.
@ledm Would it be OK to open a separate issue for your recipe? This issue is about testing recipes for the upcoming v2.9 release. Re the recipes listed above that are running out of memory: this is probably due to scitools/iris#5338.
Okay, I'm able to recreate the error! It looks like this may be something else? It seems like it's a extract_levels
preprocessor bug/
Running interactively on jasmin:
import iris
from esmvalcore.preprocessor._regrid import extract_levels
fn = "/badc/cmip6/data/CMIP6/CMIP/MOHC/UKESM1-0-LL/historical/r2i1p1f2/Omon/po4/gn/v20190708/po4_Omon_UKESM1-0-LL_historical_r2i1p1f2_gn_200001-201412.nc"
cube= iris.load_cube(fn)
c2 = extract_levels(cube, scheme='nearest', levels = [0.1 ,])
Results in Killed, I'm on 3 for 3 times now in esmvalcore v2.8.1, so there's that! I'll try other schemes and see if they are more resilient.
Let me know if this needs to go into it's own issue page!
Hope you all have lots of fun this week in Sweden!
Edit: This is killed for all four schemes and different levels:
c2 = extract_levels(cube, scheme='nearest', levels = [0.1 ,]) # killled
c2 = extract_levels(cube, scheme='nearest', levels = [0.5 ,])# killled
c2 = extract_levels(cube, scheme='linear', levels = [0.5 ,]) # killled
c2 = extract_levels(cube, scheme='nearest_extrapolate', levels = [0.5 ,]) # killled
c2 = extract_levels(cube, scheme='linear_extrapolate', levels = [0.5 ,])# killled
Edit2: The error occurs here: https://github.com/ESMValGroup/ESMValCore/blob/1101d36e3f343ec823842ea7c3f4b941ee942a89/esmvalcore/preprocessor/_regrid.py#L870
# Now perform the actual vertical interpolation.
new_data = stratify.interpolate(levels,
src_levels_broadcast,
cube.core_data(),
axis=z_axis,
interpolation=interpolation,
extrapolation=extrapolation)
I'm running stratify version '0.3.0'. Is that correct? Who runs stratify?
Looks like stratify
is one of @bjlittle's projects? Any idea what could be causing this bug, Bill?
Ooof, turns out stratify is a mix of C and python, and I'm not capable of debugging that!
Think my backup solution is to go back to this solution.
Let me know if this needs to go into it's own issue page!
Yes, please
I ran the comparison tool again because there was an issue with it. NaN values did not compare as equal in arrays, so the tool flagged too many recipe runs as changed. The new comparison results are available here. Now the number of recipe runs that are different from v2.8.0 is 61. I will add a more detailed comparison once it is ready.
Detailed comparison, including all kinds of information on what exactly changed in the output is available here. Please scroll through the document to find your recipe. The indentation indicates what recipe/variable/coordinate the information applies to.
@bouweandela carvalhais goes through fine with iris=3.6.0 (deployed) and 3.6.x (pip-installed) - that's either a fluke or something related to distributed (which I did not run with); same dask as yours too; try rerunning the recipe, methinks
@bouweandela have you rerun this recipe to have it finish OK, by any chance?
@bouweandela carvalhais goes through fine with iris=3.6.0 (deployed) and 3.6.x (pip-installed) - that's either a fluke or something related to distributed (which I did not run with); same dask as yours too; try rerunning the recipe, methinks
@bouweandela have you rerun this recipe to have it finish OK, by any chance?
Just did with iris 3.6.1 and I'm still getting the same error.
By increasing the resources (from interactivate to compute partition, or from compute to compute 512G), I was able to run successfully a few more recipes: recipe_climate_change_hotspot
, recipe_mpqb_xch4
, recipe_sea_surface_salinity
, recipe_thermodyn_diagtool
. But these recipes have longer runtimes and memory consumption than found during v2.8 release (factor >2 ). I could run these recipes successfully before and after the latest iris PR was merged without any differences in terms of performance.
By increasing the resources (from interactivate to compute partition, or from compute to compute 512G), I was able to run successfully a few more recipes:
recipe_climate_change_hotspot
,recipe_mpqb_xch4
,recipe_sea_surface_salinity
,recipe_thermodyn_diagtool
. But these recipes have longer runtimes and memory consumption than found during v2.8 release (factor >2 ). I could run these recipes successfully before and after the latest iris PR was merged without any differences in terms of performance.
I re-run some of these recipes with a distributed scheduler using the following .esmvaltool/dask.yml
file:
cluster:
type: distributed.LocalCluster
n_workers: 8
threads_per_worker: 4
memory_limit: 32 GiB
(no idea how optimal those settings are).
recipe_thermodyn_diagtool
: ran with pretty-much the exact same ressources as during the v2.8.0 testing (1:30:00 hr, ~30GiB memory). Note that >90% of the run time of that recipe is taken up by the diagnostic, so I wouldn't expect a lot of improvement with the distributed scheduler (assuming the diagnostic is not optimized).recipe_climate_change_hotspot
: failed, see below.recipe_mpqb_xch4
: failed, see below.recipe_sea_surface_salinity
: failed, see below.recipe_daily_era5.yml
: ran successfully in 18:00 min with 161 GiB of memory, but had a lot of warnings about gc.collect() took 1.933s. This is usually a sign that some tasks handle too many Python objects at the same time. Rechunking the work into smaller tasks might help.
and Worker process still alive after 3.1999995422363288 seconds, killing
. This is certainly not optimal (for v2.8.0, the recipe ran in 12 min and 31 GiB memory usage), but at least it runs. I am pretty sure this could be optimized with smarter dask settings.The runs that failed exited with errors like concurrent.futures._base.CancelledError: compute_and_return_warnings-3ee342d0-affa-4e69-9f2a-09fc9e7cbe92
during the save
step. I also see a lot of warnings like Sending large graph of size ...
or full garbage collections took XX% CPU time recently
in these runs, so my guess is that some of the preprocessors used in these recipes are not fully lazy yet.
Bottom line: I think for the fully lazy preprocessors we can achieve a similar (hopefully a much better with smarter configuration) run time and memory usage when using dask distributed, but the non-lazy preprocessors crash with that.
I tested recipe_daily_era5.yml
with both iris 3.4 and 3.6.1 without configuring Dask Distributed and it runs fine with iris 3.4 and it runs out of memory with 3.6.1.
carvalhais goes through with dask=2023.3.0 but fails a couple lines down the diag at plotting:
Traceback (most recent call last):
File "/home/b/b382109/ESMValTool/esmvaltool/diag_scripts/land_carbon_cycle/diag_global_turnover.py", line 819, in <module>
main(config)
File "/home/b/b382109/ESMValTool/esmvaltool/diag_scripts/land_carbon_cycle/diag_global_turnover.py", line 757, in main
_plot_single_map(plot_path_mod, tau_ctotal,
File "/home/b/b382109/ESMValTool/esmvaltool/diag_scripts/land_carbon_cycle/diag_global_turnover.py", line 654, in _plot_single_map
plt.imshow(_get_data_to_plot(_dat.data),
File "/home/b/b382109/miniconda3/envs/tool-latest/lib/python3.11/site-packages/matplotlib/pyplot.py", line 2695, in imshow
__ret = gca().imshow(
^^^^^^^^^^^^^
File "/home/b/b382109/miniconda3/envs/tool-latest/lib/python3.11/site-packages/cartopy/mpl/geoaxes.py", line 318, in wrapper
return func(self, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/b/b382109/miniconda3/envs/tool-latest/lib/python3.11/site-packages/cartopy/mpl/geoaxes.py", line 1331, in imshow
img, extent = warp_array(img,
^^^^^^^^^^^^^^^
File "/home/b/b382109/miniconda3/envs/tool-latest/lib/python3.11/site-packages/cartopy/img_transform.py", line 192, in warp_array
array = regrid(array, source_native_xy[0], source_native_xy[1],
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/b/b382109/miniconda3/envs/tool-latest/lib/python3.11/site-packages/cartopy/img_transform.py", line 278, in regrid
_, indices = kdtree.query(target_xyz, k=1)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "_ckdtree.pyx", line 795, in scipy.spatial._ckdtree.cKDTree.query
ValueError: 'x' must be finite, check for nan or inf values
so cartopy is breaking it now
the problem with downgrading dask is that we actually have to isolate a bug at their end, but beats me if I could do that - what I did was I saved the scalar cube that was causing the issue, and tried to repeat the operation that was not liked by dask:
import iris
def make_cube():
"""Make a scalar cube and save it."""
cube = iris.load_cube("tauxxx.nc")
return cube
def main():
"""Run the show."""
cube = make_cube()
cube.data = cube.lazy_data()
print(cube.has_lazy_data())
print(cube)
x = float(cube.core_data())
print(x)
if __name__ == "__main__":
main()
problem is, that works well, no issue. Note that I needed to force the lazy data bc the cube in the diag has indeed lazy data, whereas if you load an already scalar cube it has realized data. Am stumped :confounded:
carvalhais goes through with dask=2023.3.0 but fails a couple lines down the diag at plotting:
Traceback (most recent call last): File "/home/b/b382109/ESMValTool/esmvaltool/diag_scripts/land_carbon_cycle/diag_global_turnover.py", line 819, in <module> main(config) File "/home/b/b382109/ESMValTool/esmvaltool/diag_scripts/land_carbon_cycle/diag_global_turnover.py", line 757, in main _plot_single_map(plot_path_mod, tau_ctotal, File "/home/b/b382109/ESMValTool/esmvaltool/diag_scripts/land_carbon_cycle/diag_global_turnover.py", line 654, in _plot_single_map plt.imshow(_get_data_to_plot(_dat.data), File "/home/b/b382109/miniconda3/envs/tool-latest/lib/python3.11/site-packages/matplotlib/pyplot.py", line 2695, in imshow __ret = gca().imshow( ^^^^^^^^^^^^^ File "/home/b/b382109/miniconda3/envs/tool-latest/lib/python3.11/site-packages/cartopy/mpl/geoaxes.py", line 318, in wrapper return func(self, *args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/b/b382109/miniconda3/envs/tool-latest/lib/python3.11/site-packages/cartopy/mpl/geoaxes.py", line 1331, in imshow img, extent = warp_array(img, ^^^^^^^^^^^^^^^ File "/home/b/b382109/miniconda3/envs/tool-latest/lib/python3.11/site-packages/cartopy/img_transform.py", line 192, in warp_array array = regrid(array, source_native_xy[0], source_native_xy[1], ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/b/b382109/miniconda3/envs/tool-latest/lib/python3.11/site-packages/cartopy/img_transform.py", line 278, in regrid _, indices = kdtree.query(target_xyz, k=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "_ckdtree.pyx", line 795, in scipy.spatial._ckdtree.cKDTree.query ValueError: 'x' must be finite, check for nan or inf values
so cartopy is breaking it now
this is signaled and raised here https://github.com/SciTools/cartopy/issues/2199 and it looks like they already have a PR to fix it :partying_face:
Summary of relevant differences:
Recipes where the plots are identical, but (some) data differs
For the above recipes, I suspect we might want to try if we can make the comparison if the data is the same more tolerant, since the plots are considered identical enough.
The recipes below need a check by the @ESMValGroup/esmvaltool-recipe-maintainers (please let us know if you are no longer interested in being a maintainer, e.g. by commenting in this issue):
Recipes where (some) plots differ, but not the data
(Some) plots and data are different
The results from running the recipes with v2.8.0 of the tool are available here and the results from running them with v2.9.0 of the tool are available here.
You can see very specifically what is different in the comparison tool output files: detailed comparison part 1 and detailed comparison part 2.
@ESMValGroup/technical-lead-development-team Because we're seeing a considerable performance regression for some recipes when using the basic/default Dask scheduler, I would propose that we make the release with iris pinned to v3.4 or greater instead of 3.6 or later. That would allow users who are really keen on using the basic scheduler to do so. Please let me know if there's objections to that.
Out of interest, what was the last ESMValTool version that uses iris 3.4?
Out of interest, what was the last ESMValTool version that uses iris 3.4?
For the release of v2.8.0, we used iris-3.4.1 to test all recipes
@bouweandela thanks for the list! The plot from recipe_spei.yml only changed the colour (because R chooses different ones every time, I should set fixed ones there but I guess there is not enough time to change that before the release now). For recipe_flato13ipcc_figures_938_941_cmip3.yml, recipe_flato13ipcc_figures_938_941_cmip6.yml and recipe_weigel21gmd_figures_13_16.yml there is a somewhat different setting of the fx files now (see #3215), which explains the changed values.
Thanks for checking @katjaweigel! I was planning to make the release on Thursday next week, so if you still want to fix the plotting in recipe_spei.yml before that time you can do it.
Regarding the list of "Recipes that failed because they used too much memory" (in this https://github.com/ESMValGroup/ESMValTool/issues/3239#issuecomment-1602719758), I was able to run them successfully on Levante, except one, by increasing the computational resources required (see #3216). This does not really tell us where the problems come from but will nevertheless help the RM who gets less failures.
The biggest increase of resources required is for the 2 cloud recipes. Using 1024 GB compute nodes and max_parallel_tasks=3
, I was able to run recipe_lauer22jclim_fig5_lifrac
(Time for running the recipe was: 5:20:57.188249
, Maximum memory used (estimate): 881.6 GB
) but not recipe_lauer22jclim_fig3-4_zonal
(ends with 1 tasks running, 1 tasks waiting for ancestors, 14/16 done
after the 8h limit). FYI, @axel-lauer.
We did another round of test runs. The results are available here: https://esmvaltool.dkrz.de/shared/esmvaltool/v2.9.0/
Here is the environment.yml and here are the comparison results and the detailed comparison results. The comparison was done with the previous round of testing for this release. Because of disk space limitations, I had to remove the v2.9.0rc1 results from the webpage, but the results are still available in /work/bd0854/b381141/esmvaltool_output/esmvaltool-v2.9.x-2023-06-20
for those who have access to Levante.
All recipes ran successfully except for:
@bouweandela here's your Portuguese issue, bud https://github.com/ESMValGroup/ESMValTool/issues/3281
I think we can safely close this now that we have #3287 - @bouweandela @remi-kazeroni since you put so much effort in it, I'd feel bad closing it myself, will you do the service pls :medal_military:
Hi, I've been away on vacation and didn't see the mention to check differing plots before now. To me it looks like only consecutive dry days have changed, and I wondered if it's just that the version of consecutive dry days has changed from the one that caps the number at the end of a year to the one that runs over multiple years when the longest dry period extends across calendar years. That would explain the plot discrepancy, they weird end of the original plot (in 2.8), and would anyway be the more sensible thing to have plotted this way. I.e. I think that this is an improvement, then.
This issue documents recipe testing for the v2.9.0 release
Installation and environment
Environment file
environment.txt
Config user file
All options at the default for Levante except
search_esgf: when_missing
.