ESMValGroup / ESMValTool

ESMValTool: A community diagnostic and performance metrics tool for routine evaluation of Earth system models in CMIP
https://www.esmvaltool.org
Apache License 2.0
216 stars 126 forks source link

`recipe_collins13ipcc` fails due to memory issue with `extract_levels` preproc #3106

Closed remi-kazeroni closed 1 year ago

remi-kazeroni commented 1 year ago

Describe the bug Since Levante is available, it has been possible to run this recipe on the fat compute nodes (512 GB), see run with v2.7.0. During the recipe testing for v2.8, jobs fail due to OUT_OF_MEMORY issue related to the known most memory intensive diagnostic: IAV_calc_thetao. I'm not sure this failure is telling us of an issue with the Core or if the recipe needs fixing w.r.t to usage of regridding. Would be great if someone has the time to investigate that a bit. To ease the work, I've made a shorter version of the recipe, using only 1 preproc and 1 diag, see below.

Please attach

valeriupredoi commented 1 year ago

let me have a crack at it now, @remi-kazeroni :+1:

valeriupredoi commented 1 year ago

quick question - what resolution are those datasets in the recipe? We need to find out the optimal number:

then we definitely want to regrid after we extract the levels

schlunma commented 1 year ago

I am also currently running some tests on this in the background. The shape of the problematic data is

<iris 'Cube' of sea_water_potential_temperature / (K) (time: 2880; depth: 40; latitude: 216; longitude: 360)>

This results in around 267 GiB of data, which all gets realized in stratify.interpolate in our code here, so you need a lot of memory :sweat_smile:

@bouweandela I saw that there's a merged PR in the stratify repository from you about making this function lazy, but it looks like this hasn't made it into a release yet. Did I get that right?

valeriupredoi commented 1 year ago

ay ay ay - BTW I was trying to dig out the olevel points from CMIP5 - I see the recipe is asking for exactly 40 depth points, what are the chances those differ by only half a meter from the actual points the file comes with, and we are doing all that stratifying for a mere difference in value?

valeriupredoi commented 1 year ago

latitude: 216; longitude: 360 means almost 1x1degs but not quite - not much coarser though

valeriupredoi commented 1 year ago

tough luck - me job got killed out of memory:

#!/bin/bash -l 

#SBATCH --job-name=recipe_collins.%J
#SBATCH --output=/home/b/b382109/output_collins_short/recipe_collins.%J.out
#SBATCH --error=/home/b/b382109/output_collins_short/recipe_collins.%J.err
#SBATCH --account=bk1088
#SBATCH --partition=compute 
#SBATCH --time=08:00:00 
#SBATCH --constraint=512G 
#SBATCH --mail-user=valeriu.predoi@ncas.ac.uk 
#SBATCH --mail-type=FAIL,END 

set -eo pipefail 
unset PYTHONPATH 

. /home/b/b382109/miniconda3/etc/profile.d/conda.sh
conda activate release280rc1

esmvaltool run /home/b/b382109/recipe_collins_short.yml

err says:

/var/spool/slurmd/job4239266/slurm_script: line 19: 4009775 Killed                  esmvaltool run /home/b/b382109/recipe_collins_short.yml
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=4239266.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.
valeriupredoi commented 1 year ago

gonna ask for MOAR MEMORY!

remi-kazeroni commented 1 year ago

gonna ask for MOAR MEMORY!

I tried it already with #SBATCH --constraint=1024G (max available at DKRZ) but got the same issue... Would it good to understand what has changed since v2.7 so that this recipe can't be run anymore

valeriupredoi commented 1 year ago

indeed! Has the recipe changed in any way? That's be the first low hanging fruit

schlunma commented 1 year ago

indeed! Has the recipe changed in any way? That's be the first low hanging fruit

No, last change on 8 March 2022.

Some results from my side: I tried only running this with one single model (HadGEM) and got a successful run with a max memory usage of 261 GiB (really close to what I expected).

I'm currently running the full testing recipe (all three models) with version 2.7 of the code using the 2.8 environment. If this is successful, we know that it's a problem on our side. If this fails, then it's a problem of some dependency. Stay tuned :radio:

valeriupredoi commented 1 year ago

that is proper black box testing @schlunma :medal_military:

bouweandela commented 1 year ago

@bouweandela I saw that there's a merged PR in the stratify repository from you about making this function lazy, but it looks like this hasn't made it into a release yet. Did I get that right?

Yes, see https://github.com/SciTools/python-stratify/issues/54

bouweandela commented 1 year ago

I ran into a similar issue earlier this week, where a very simple recipe ran out of memory on my laptop with 16GB RAM. Creating a new environment solved it for me.

schlunma commented 1 year ago

The test with our v2.7 code and a fresh 2.8 environment ran successful (maximum memory usage: 321.9 GB). I am currently running the recipe with our v2.8 code and a fresh 2.8 environment. If that fails, I guess we can safely assume that something in our code causes this. If this also runs successful, I have no idea what's happening :shrug:

valeriupredoi commented 1 year ago

The only explanation I can think of for a fresh environment with a fixed Core version shrinking the required memory, that doesn't involve witchcraft/wizardry, is that some numerical dependency had a serious memory-related bug, that was eventually fixed in a later/latest version. Am quite curious to see Manu's results 🍿

schlunma commented 1 year ago

All right, this second test with the current RC also ran fine for me...so it might actually be an issue with a dependency that has been updated recently.

This is my environment in case you want to compare: env280rc1.txt

valeriupredoi commented 1 year ago

ooh exciting! Is that the latest environment, Manu? - the one that ran the recipe, that is?

schlunma commented 1 year ago

the one that ran the recipe, that is?

Yes!

valeriupredoi commented 1 year ago

cool! could you maybe do a diff between the two envs at your end plss @schlunma - the one that didn't run the thing vs the one that did; you can get a nicely formatted yaml env list with conda env export > myenv.yml - I've just looked at my envs (older, that didn't run the recipe vs a freash one I just built) and there just a few differences and nothing that jumps out the screen as a dep that has changed and it could have influenced the memory consumption - have a look diff_envs.txt

schlunma commented 1 year ago

I don't have an env that didn't run the recipe, but this is the comparison of my environment (bottom, the one that worked) vs. the one of @remi-kazeroni (top, the one that didn't work). I cannot find anything obvious as well...

< # packages in environment at /work/bd0854/b309192/soft/mambaforge/envs/tool_280rc1:
---
> # packages in environment at /work/bd0854/b309141/mambaforge/envs/esm2:
17c17
< astroid                   2.14.2          py310hff52083_0    conda-forge
---
> astroid                   2.15.0          py310hff52083_0    conda-forge
93,94c93,94
< esmvalcore                2.8.0rc2.dev4+g1da5904f2          pypi_0    pypi
< esmvaltool                2.8.0.dev95+gde8d1bdad          pypi_0    pypi
---
> esmvalcore                2.8.0rc1                 pypi_0    pypi
> esmvaltool                2.8.0.dev79+g407cfe5b4          pypi_0    pypi
117c117
< fonttools                 4.38.0          py310h5764c6d_1    conda-forge
---
> fonttools                 4.39.0          py310h1fa729e_0    conda-forge
123c123
< fsspec                    2023.1.0           pyhd8ed1ab_0    conda-forge
---
> fsspec                    2023.3.0           pyhd8ed1ab_1    conda-forge
257c257
< matplotlib-base           3.7.0           py310he60537e_0    conda-forge
---
> matplotlib-base           3.7.1           py310he60537e_0    conda-forge
272c272
< mypy                      1.0.1                    pypi_0    pypi
---
> mypy                      1.1.1                    pypi_0    pypi
317c317
< platformdirs              3.0.0              pyhd8ed1ab_0    conda-forge
---
> platformdirs              3.1.0              pyhd8ed1ab_0    conda-forge
348c348
< pylint                    2.16.3             pyhd8ed1ab_0    conda-forge
---
> pylint                    2.16.4             pyhd8ed1ab_0    conda-forge
361c361
< pytest                    7.2.1              pyhd8ed1ab_0    conda-forge
---
> pytest                    7.2.2              pyhd8ed1ab_0    conda-forge
492c492
< r-styler                  1.9.0             r41hc72bb7e_0    conda-forge
---
> r-styler                  1.9.1             r41hc72bb7e_0    conda-forge
498c498
< r-udunits2                0.13.2.1          r41h06615bd_1    conda-forge
---
> r-udunits2                0.13.2.1          r41h133d619_1    conda-forge
528c528
< setuptools                67.4.0             pyhd8ed1ab_0    conda-forge
---
> setuptools                67.5.1             pyhd8ed1ab_0    conda-forge
568c568
< tqdm                      4.64.1             pyhd8ed1ab_0    conda-forge
---
> tqdm                      4.65.0             pyhd8ed1ab_1    conda-forge
remi-kazeroni commented 1 year ago

I also don't have an explanation for what when wrong. Here is my "old" env file with which I could not run the recipe: env280rc1_rk.txt and here the fresh new env with which @schlunma could run the recipe: env280rc1_ms.txt

< name: new_env_ms
---
> name: old_env_rk
18c18
<   - astroid=2.15.0=py310hff52083_0
---
>   - astroid=2.14.2=py310hff52083_0
107c107
<   - fonttools=4.39.0=py310h1fa729e_0
---
>   - fonttools=4.38.0=py310h5764c6d_1
113c113
<   - fsspec=2023.3.0=pyhd8ed1ab_1
---
>   - fsspec=2023.1.0=pyhd8ed1ab_0
241c241
<   - matplotlib-base=3.7.1=py310he60537e_0
---
>   - matplotlib-base=3.7.0=py310he60537e_0
290c290
<   - platformdirs=3.1.0=pyhd8ed1ab_0
---
>   - platformdirs=3.0.0=pyhd8ed1ab_0
317c317
<   - pylint=2.16.4=pyhd8ed1ab_0
---
>   - pylint=2.16.3=pyhd8ed1ab_0
329c329
<   - pytest=7.2.2=pyhd8ed1ab_0
---
>   - pytest=7.2.1=pyhd8ed1ab_0
458c458
<   - r-styler=1.9.1=r41hc72bb7e_0
---
>   - r-styler=1.9.0=r41hc72bb7e_0
464c464
<   - r-udunits2=0.13.2.1=r41h133d619_1
---
>   - r-udunits2=0.13.2.1=r41h06615bd_1
494c494
<   - setuptools=67.5.1=pyhd8ed1ab_0
---
>   - setuptools=67.4.0=pyhd8ed1ab_0
530c530
<   - tqdm=4.65.0=pyhd8ed1ab_1
---
>   - tqdm=4.64.1=pyhd8ed1ab_0
597c597
<       - esmvaltool==2.8.0.dev79+g407cfe5b4
---
>       - esmvaltool==2.8.0.dev77+gd0c0c038e
609c609
<       - mypy==1.1.1
---
>       - mypy==1.0.1
636c636

I will create a fresh new env, run the recipe once more and close this issue if all works fine. This is the last pending thing on our way to v2.8.0!

valeriupredoi commented 1 year ago

cool! Cheers for envs listings, fellas - I honestly can't see anything that might affect memory usage. My runs with a fresh env died again out of memory so I am even more puzzled now, but heck it - let's not shed more tears over this, it's a memory guzzler anyway and a long-term fix is needed - godspeed @remi-kazeroni with the last run of it and let's get 2.8 take the checkered flag :checkered_flag:

remi-kazeroni commented 1 year ago

My new test run also died out of memory. Earlier this week I had another test run which timed out. But @schlunma got a successful run. This could also hints towards a problem on the DKRZ side. I know that they have recently changed some settings of the slurm schedulers... @schlunma and I will both submit a couple more runs to see if we can get a systematic error, we report on this tomorrow morning. If the issue is on the DKRZ side, we could probably close this and move forward with the release.

schlunma commented 1 year ago

I re-ran this recipe again 4 times and got 4 successful runs. I have no idea what's causing this weird behavior, but I suggest to continue with our release and close this issue. I don't think anything in our code causes this :+1:

remi-kazeroni commented 1 year ago

Thanks a lot for all the testing @schlunma! v2.8.0rc2 is now on the launchpad 🚀

remi-kazeroni commented 1 year ago

Because this surfaced again during the second round of testing (see #3127), @schlunma and I investigated a bit more. It turns out the entire memory of fat compute nodes of Levante (requested via --constraint=512G or --constraint=1024G) is not fully used unless one specifies --mem=0 in their batch script. @schlunma had that option by default and could run the recipe successfully but not me... Default DKRZ SLURM settings were changed recently.

Asked DKRZ about this:

we have made some changes to memory recently. Namely, the default memory per CPU has been reduced from 960 MB to 940 MB. When using the larger nodes (512G and 1024G), you need to specify --mem=0 to get the full memory. For the 256G nodes, --mem=0 is not necessary. For the billing, using --mem=0 does not make a difference.

That is why recipe_collins13ipcc was running fine on Levante with previous ESMValTool versions but not for v2.8.

Conclusion: just use --mem=0 with the compute partition of Levante.

valeriupredoi commented 1 year ago

a discovery akin the discovery of toast :bread: I'll put this in the documentation, then :+1:

HGWright commented 1 year ago

@bouweandela @valeriupredoi @schlunma @remi-kazeroni Version 0.3.0 of Python-Stratify is now released and available on PYPI and Conda-Forge. Please install, use and be sure to let us know if any issues come up.