Closed ledm closed 2 months ago
stratify is lazy since v0.3.0 and the extract_levels
preprocessor is lazy in the ESMValCore development branch and release candidate ESMValCore v2.9.0rc1. The iris function broadcast_to_shape
is now lazy (https://github.com/SciTools/iris/pull/5359), but it is not yet in a released version of iris.
You could try installing iris from source (clone the repository and run pip install .
) or wait for the upcoming iris 3.6.1 release.
Okay, that's not the problem either!
Just loading the data is enough for it to get killed!
import iris
# from esmvalcore.preprocessor._regrid import extract_levels
from esmvalcore.preprocessor._volume import extract_surface
fn = "/badc/cmip6/data/CMIP6/CMIP/MOHC/UKESM1-0-LL/historical/r2i1p1f2/Omon/po4/gn/v20190708/po4_Omon_UKESM1-0-LL_historical_r2i1p1f2_gn_200001-201412.nc"
print('load cube:', fn)
cube= iris.load_cube(fn)
print(cube)
print(cube.data[:,0,:,:])
Also results in Killed. It's not @bjlittle's stratify's fault.
Just loading this data file breaks.
That's because you're trying to load all the data into memory, maybe it doesn't fit?
Try something like
import iris
fn = "/badc/cmip6/data/CMIP6/CMIP/MOHC/UKESM1-0-LL/historical/r2i1p1f2/Omon/po4/gn/v20190708/po4_Omon_UKESM1-0-LL_historical_r2i1p1f2_gn_200001-201412.nc"
print('load cube:', fn)
cube= iris.load_cube(fn)
print(cube)
print(cube.core_data()[:,0,:,:])
This works, thanks! .... but this returns a dask array, which is not what I want. I just want to extract the surface layer of a cube, returning a cube (convert 4D -> 3D, or 3D -> 2D). extract_layer
is unable to do that for these files either!
Also, I should say that I've tried moving the preprocessor order around and I had the same problem with regrid
as well. I think that likely also realises the data, @bouweandela.
iris=3.6.1 is now available on conda-forge and it gets pulled in our environment, so if you can try regenerating the env and use it, see if that fixes your issue @ledm :beer:
Just to confirm my email @valeriupredoi, updating to iris=3.6.1 does not solve this issue.
Method:
mamba install iris=3.6.1
in ESMValCore:
pip install --editable '.[develop]'
Then in an interactive python script:
>>> import iris
>>> iris.__version__
'3.6.1'
>>> import esmvalcore
>>> esmvalcore.__version__
'2.9.0.dev0+gb12682d2a.d20230627'
>>> import stratify
>>> stratify.__version__
'0.3.0'
Okay, so more investigation: watching top while running the script at the start of this issue results in a huge spike in MEM usage. The file itself is only 2GB, but I've seen up to 8GB using top. Memory being several times larger that the file suggests a memory issue in iris/stratify.
This is probably why re-ordering the preprocessors failed me earlier. I had assumed that if I extracted a smaller region first, then the surface layer, it would mean that less memory would be needed (this didn't work!). A memory leak means that it doesn't really matter how small a region you make it, as it will leak and break anyway.
@ledm here's what I found out: the script you gave me ie
import iris
from esmvalcore.preprocessor._regrid import extract_levels
fn = "/badc/cmip6/data/CMIP6/CMIP/MOHC/UKESM1-0-LL/historical/r2i1p1f2/Omon/po4/gn/v20190708/po4_Omon_UKESM1-0-LL_historical_r2i1p1f2_gn_200001-201412.nc"
cube= iris.load_cube(fn)
c2 = extract_levels(cube, scheme='nearest', levels = [0.1 ,])
needs 13G of memory RES (resident) to run to completion; this with:
esmvalcore 2.9.0rc1 pyh39db41b_0 conda-forge/label/esmvalcore_rc
esmvaltool 2.9.0.dev41+gda7f3dbe6 pypi_0 pypi
iris 3.6.1 pyha770c72_0 conda-forge
python-stratify 0.3.0 py311h1f0f07a_0 conda-forge
and the file in question is indeed 2GB but do remember that's a compressed netCDF4 file, usually with a 40% compression factor. So that means that extract_levels
loads the entire data into memory roughly 3 times. @bouweandela says extract levels is not lazy so it's very clear how not lazy it is - why the footprint is so bad ie about 3x larger than the actual file size in memory is beyond me. Sorry I misinterpreted thinking new iris will solve this, obv not. But the question is - why is sci3 killing your job when it only needs 13G of mem? Unless that job was different, I see no reason why. Now, I believe stratify is lazy now, so we can go about and make the extract levels lazy, in fact we should do that, but in the meantime, try running on a node that may not kick you out :grin:
the source of this problem is vinterp
(old name) or stratify.interpolate()
(new name) becoming completely realized/computed/not lazy due to levels
and src_levels_broadcast
being <class 'numpy.ndarray'>
- this is sexacyly @bouweandela 's issue https://github.com/ESMValGroup/ESMValCore/issues/2114 - just to confirm, indeed, the dta in the example above is <class 'dask.array.core.Array'>
- so making the coords lazy should be easy
try running on a node that may not kick you out
Lol, if only it were that easy. This gets killed for me on sci1, sci3, sci4, sci6, and the LOTUS high-mem queue!
sci2 did the trick for me. We now know where the problem lies, so fixing should follow 😁
Okay - running my original recipe (lol not fried chicken!) on sci2 now. Don't know if this is useful information, but it's trying to download 20GB of data from ESGF now. Not sure why it never got there before on sci1. (sci3 isn't connected to ESGF, I don't think)
Okay, so reverted to ESMValTool 2.8, and iris 3.4. I'm still running out of memory, but at least it's breaking properly, instead of just getting killed:
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 8.76 GiB for an array with shape (1176120000,) and data type float64
Calling this a big W.
Okay, so reverted to ESMValTool 2.8, and iris 3.4. I'm still running out of memory, but at least it's breaking properly, instead of just getting killed:
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 8.76 GiB for an array with shape (1176120000,) and data type float64
Calling this a big W.
Correction. This was on sci2. On sci3, it just got killed the normal way. No idea whats going on. Starting to think it's a jasmin thing. Will try sci6 next.
Continuing with this, here's a minimal testing recipe.
On JASMIN.sci1 this fails for me. If I comment out either recipe, it runs fine.
The fact that it works with one dataset but fails with two makes me think that perhaps something isn't being properly closed after it's finished? Or its trying to run two things at once, even when max_parallel_tasks: 1
in my config-user file.
The issue mentioned in the top post has been solved in https://github.com/ESMValGroup/ESMValCore/pull/2120 which will be available in the upcoming v2.11.0 release of ESMValCore.
I also investigated the recipe in https://github.com/ESMValGroup/ESMValTool/issues/3244#issuecomment-1623305615:
climate_statistics
preprocessor function caused by the 1D temporal weights consisting of a single Dask chunk, this results in too large chunks. This should be fixed by https://github.com/ESMValGroup/ESMValCore/pull/2404.esmvalcore.regrid
preprocessor function automatically did that whenever possible. Opened https://github.com/ESMValGroup/ESMValCore/issues/2405 to discuss the posibilities.Continuing with this, here's a minimal testing recipe.
@ledm The recipe now runs with the ESMValCore main
branch (and soon to be released v2.11.0). Even though regridding is not lazy, this isn't such a problem as the data has already been reduced in size a lot by computing the climate statistics and vertical level extraction before regridding.
With https://github.com/ESMValGroup/ESMValCore/pull/2457 merged regridding is now automatically lazy for data with 2D lat/lon coordinates as well.
On jasmin, jobs are being killed when the following code runs:
This occurs with several versions of esmvalcore (2.8.0, 2.8.1, 2.9.0).
The error occurs for all four schemes and a range of level values (0.0, 0.1, 0.5)
In all cases, the error occurs here: https://github.com/ESMValGroup/ESMValCore/blob/1101d36e3f343ec823842ea7c3f4b941ee942a89/esmvalcore/preprocessor/_regrid.py#L870
Stratify (version '0.3.0') is a c/python interface wrapper and it previously caused trouble. It is not lazy so it may try to load 120GB files into memory and other issues like that. My previous solution to this problem was the write my own preprocessor:
https://github.com/ESMValGroup/ESMValCore/issues/1039 https://github.com/ESMValGroup/ESMValCore/pull/1048
Which has been abandoned, but I'm tempted to bring it back. (The deadline for this piece of work is 24th july!)
This is an extension of the discussion here: https://github.com/ESMValGroup/ESMValTool/issues/3239