ESMValGroup / ESMValTool

ESMValTool: A community diagnostic and performance metrics tool for routine evaluation of Earth system models in CMIP
https://www.esmvaltool.org
Apache License 2.0
224 stars 128 forks source link

Tasks getting killed on Jasmin due to stratify being called from esmvalcore.preprocessor._regrid.extract_levels() preprocessor #3244

Closed ledm closed 2 months ago

ledm commented 1 year ago

On jasmin, jobs are being killed when the following code runs:

import iris
from esmvalcore.preprocessor._regrid import extract_levels

fn = "/badc/cmip6/data/CMIP6/CMIP/MOHC/UKESM1-0-LL/historical/r2i1p1f2/Omon/po4/gn/v20190708/po4_Omon_UKESM1-0-LL_historical_r2i1p1f2_gn_200001-201412.nc"

cube= iris.load_cube(fn)
c2 = extract_levels(cube, scheme='nearest', levels = [0.1 ,])

This occurs with several versions of esmvalcore (2.8.0, 2.8.1, 2.9.0).

The error occurs for all four schemes and a range of level values (0.0, 0.1, 0.5)

c2 = extract_levels(cube, scheme='nearest', levels = [0.1 ,]) # killled
c2 = extract_levels(cube, scheme='nearest', levels = [0.5 ,])# killled
c2 = extract_levels(cube, scheme='linear', levels = [0.5 ,]) # killled
c2 = extract_levels(cube, scheme='nearest_extrapolate', levels = [0.5 ,]) # killled
c2 = extract_levels(cube, scheme='linear_extrapolate', levels = [0.5 ,])# killled

In all cases, the error occurs here: https://github.com/ESMValGroup/ESMValCore/blob/1101d36e3f343ec823842ea7c3f4b941ee942a89/esmvalcore/preprocessor/_regrid.py#L870

    # Now perform the actual vertical interpolation.
    new_data = stratify.interpolate(levels,
                                    src_levels_broadcast,
                                    cube.core_data(),
                                    axis=z_axis,
                                    interpolation=interpolation,
                                    extrapolation=extrapolation)

Stratify (version '0.3.0') is a c/python interface wrapper and it previously caused trouble. It is not lazy so it may try to load 120GB files into memory and other issues like that. My previous solution to this problem was the write my own preprocessor:

https://github.com/ESMValGroup/ESMValCore/issues/1039 https://github.com/ESMValGroup/ESMValCore/pull/1048

Which has been abandoned, but I'm tempted to bring it back. (The deadline for this piece of work is 24th july!)

This is an extension of the discussion here: https://github.com/ESMValGroup/ESMValTool/issues/3239

bouweandela commented 1 year ago

stratify is lazy since v0.3.0 and the extract_levels preprocessor is lazy in the ESMValCore development branch and release candidate ESMValCore v2.9.0rc1. The iris function broadcast_to_shape is now lazy (https://github.com/SciTools/iris/pull/5359), but it is not yet in a released version of iris.

You could try installing iris from source (clone the repository and run pip install .) or wait for the upcoming iris 3.6.1 release.

ledm commented 1 year ago

Okay, that's not the problem either!

Just loading the data is enough for it to get killed!

import iris
# from esmvalcore.preprocessor._regrid import extract_levels
from esmvalcore.preprocessor._volume import extract_surface

fn = "/badc/cmip6/data/CMIP6/CMIP/MOHC/UKESM1-0-LL/historical/r2i1p1f2/Omon/po4/gn/v20190708/po4_Omon_UKESM1-0-LL_historical_r2i1p1f2_gn_200001-201412.nc"
print('load cube:', fn)
cube= iris.load_cube(fn)

print(cube)
print(cube.data[:,0,:,:])

Also results in Killed. It's not @bjlittle's stratify's fault.

Just loading this data file breaks.

bouweandela commented 1 year ago

That's because you're trying to load all the data into memory, maybe it doesn't fit?

Try something like

import iris

fn = "/badc/cmip6/data/CMIP6/CMIP/MOHC/UKESM1-0-LL/historical/r2i1p1f2/Omon/po4/gn/v20190708/po4_Omon_UKESM1-0-LL_historical_r2i1p1f2_gn_200001-201412.nc"
print('load cube:', fn)
cube= iris.load_cube(fn)

print(cube)
print(cube.core_data()[:,0,:,:])
bouweandela commented 1 year ago

See also https://github.com/ESMValGroup/ESMValCore/issues/2114

ledm commented 1 year ago

This works, thanks! .... but this returns a dask array, which is not what I want. I just want to extract the surface layer of a cube, returning a cube (convert 4D -> 3D, or 3D -> 2D). extract_layer is unable to do that for these files either!

ledm commented 1 year ago

Also, I should say that I've tried moving the preprocessor order around and I had the same problem with regrid as well. I think that likely also realises the data, @bouweandela.

valeriupredoi commented 1 year ago

iris=3.6.1 is now available on conda-forge and it gets pulled in our environment, so if you can try regenerating the env and use it, see if that fixes your issue @ledm :beer:

ledm commented 1 year ago

Just to confirm my email @valeriupredoi, updating to iris=3.6.1 does not solve this issue.

Method:

mamba install iris=3.6.1

in ESMValCore:

pip install --editable '.[develop]'

Then in an interactive python script:

>>> import iris
>>> iris.__version__
'3.6.1'
>>> import esmvalcore
>>> esmvalcore.__version__
'2.9.0.dev0+gb12682d2a.d20230627'
>>> import stratify
>>> stratify.__version__
'0.3.0'
ledm commented 1 year ago

Okay, so more investigation: watching top while running the script at the start of this issue results in a huge spike in MEM usage. The file itself is only 2GB, but I've seen up to 8GB using top. Memory being several times larger that the file suggests a memory issue in iris/stratify.

This is probably why re-ordering the preprocessors failed me earlier. I had assumed that if I extracted a smaller region first, then the surface layer, it would mean that less memory would be needed (this didn't work!). A memory leak means that it doesn't really matter how small a region you make it, as it will leak and break anyway.

valeriupredoi commented 1 year ago

@ledm here's what I found out: the script you gave me ie

import iris
from esmvalcore.preprocessor._regrid import extract_levels

fn = "/badc/cmip6/data/CMIP6/CMIP/MOHC/UKESM1-0-LL/historical/r2i1p1f2/Omon/po4/gn/v20190708/po4_Omon_UKESM1-0-LL_historical_r2i1p1f2_gn_200001-201412.nc"

cube= iris.load_cube(fn)
c2 = extract_levels(cube, scheme='nearest', levels = [0.1 ,])

needs 13G of memory RES (resident) to run to completion; this with:

esmvalcore                2.9.0rc1           pyh39db41b_0    conda-forge/label/esmvalcore_rc
esmvaltool                2.9.0.dev41+gda7f3dbe6          pypi_0    pypi
iris                      3.6.1              pyha770c72_0    conda-forge
python-stratify           0.3.0           py311h1f0f07a_0    conda-forge

and the file in question is indeed 2GB but do remember that's a compressed netCDF4 file, usually with a 40% compression factor. So that means that extract_levels loads the entire data into memory roughly 3 times. @bouweandela says extract levels is not lazy so it's very clear how not lazy it is - why the footprint is so bad ie about 3x larger than the actual file size in memory is beyond me. Sorry I misinterpreted thinking new iris will solve this, obv not. But the question is - why is sci3 killing your job when it only needs 13G of mem? Unless that job was different, I see no reason why. Now, I believe stratify is lazy now, so we can go about and make the extract levels lazy, in fact we should do that, but in the meantime, try running on a node that may not kick you out :grin:

valeriupredoi commented 1 year ago

the source of this problem is vinterp (old name) or stratify.interpolate() (new name) becoming completely realized/computed/not lazy due to levels and src_levels_broadcast being <class 'numpy.ndarray'> - this is sexacyly @bouweandela 's issue https://github.com/ESMValGroup/ESMValCore/issues/2114 - just to confirm, indeed, the dta in the example above is <class 'dask.array.core.Array'> - so making the coords lazy should be easy

ledm commented 1 year ago

try running on a node that may not kick you out

Lol, if only it were that easy. This gets killed for me on sci1, sci3, sci4, sci6, and the LOTUS high-mem queue!

valeriupredoi commented 1 year ago

sci2 did the trick for me. We now know where the problem lies, so fixing should follow 😁

ledm commented 1 year ago

Okay - running my original recipe (lol not fried chicken!) on sci2 now. Don't know if this is useful information, but it's trying to download 20GB of data from ESGF now. Not sure why it never got there before on sci1. (sci3 isn't connected to ESGF, I don't think)

ledm commented 1 year ago

Okay, so reverted to ESMValTool 2.8, and iris 3.4. I'm still running out of memory, but at least it's breaking properly, instead of just getting killed:

numpy.core._exceptions._ArrayMemoryError: Unable to allocate 8.76 GiB for an array with shape (1176120000,) and data type float64

Calling this a big W.

ledm commented 1 year ago

Okay, so reverted to ESMValTool 2.8, and iris 3.4. I'm still running out of memory, but at least it's breaking properly, instead of just getting killed:

numpy.core._exceptions._ArrayMemoryError: Unable to allocate 8.76 GiB for an array with shape (1176120000,) and data type float64

Calling this a big W.

Correction. This was on sci2. On sci3, it just got killed the normal way. No idea whats going on. Starting to think it's a jasmin thing. Will try sci6 next.

ledm commented 1 year ago

Continuing with this, here's a minimal testing recipe.

https://github.com/ESMValGroup/ESMValTool/blob/AscensionIslandMarineProtectedArea/esmvaltool/recipes/ai_mpa/recipe_ocean_ai_mpa_o2_testing.yml

On JASMIN.sci1 this fails for me. If I comment out either recipe, it runs fine.

The fact that it works with one dataset but fails with two makes me think that perhaps something isn't being properly closed after it's finished? Or its trying to run two things at once, even when max_parallel_tasks: 1 in my config-user file.

bouweandela commented 6 months ago

The issue mentioned in the top post has been solved in https://github.com/ESMValGroup/ESMValCore/pull/2120 which will be available in the upcoming v2.11.0 release of ESMValCore.

I also investigated the recipe in https://github.com/ESMValGroup/ESMValTool/issues/3244#issuecomment-1623305615:

bouweandela commented 6 months ago

Continuing with this, here's a minimal testing recipe.

https://github.com/ESMValGroup/ESMValTool/blob/AscensionIslandMarineProtectedArea/esmvaltool/recipes/ai_mpa/recipe_ocean_ai_mpa_o2_testing.yml

@ledm The recipe now runs with the ESMValCore main branch (and soon to be released v2.11.0). Even though regridding is not lazy, this isn't such a problem as the data has already been reduced in size a lot by computing the climate statistics and vertical level extraction before regridding.

bouweandela commented 2 months ago

With https://github.com/ESMValGroup/ESMValCore/pull/2457 merged regridding is now automatically lazy for data with 2D lat/lon coordinates as well.