Closed valeriupredoi closed 4 years ago
could be but why is it creeping up here? I need to investigate moar... :beer:
The problem occurs if one of the input files is opened by _recipe.py, e.g. to extract the vertical levels to interpolate to, or by _data_finder.py to find the start and end year (in the case of CMIP3 data) and then later in another process by the preprocessor.
Maybe you could attach the recipe so other people can try to reproduce the issue? I cannot find recipe_diagnostic_transect.yml
in the ESMValTool.
yeah sorry @bouweandela been busy looking at other stuff in the meantime, here it is, I'll start poking around now too, on Jasmin :beer: recipe_sections.txt OM_diagnostic_transects.py.txt also pinging @omeuriot since she's been running it :beer:
I know what's the problem - the ESMF regridder is trying to construct the regridder but it's not getting enough memory allocated, and it stays there in limbo (probably waiting for memory to be freed on the node), if you use select_levels
instead, no ESMF mumbojumbo, you'll get the memory allocation right away:
File "/home/users/valeriu/esmvalcore_users/esmvalcore/preprocessor/__init__.py", line 224, in _run_preproc_function
return function(items, **kwargs)
File "/home/users/valeriu/esmvalcore_users/esmvalcore/preprocessor/_regrid.py", line 532, in extract_levels
cube, src_levels, levels, scheme, extrap_scheme)
File "/home/users/valeriu/esmvalcore_users/esmvalcore/preprocessor/_regrid.py", line 436, in _vertical_interpolate
if np.ma.is_masked(cube.data):
File "/home/users/valeriu/anaconda3R/envs/esmvaltool_users/lib/python3.7/site-packages/iris/cube.py", line 1726, in data
return self._data_manager.data
File "/home/users/valeriu/anaconda3R/envs/esmvaltool_users/lib/python3.7/site-packages/iris/_data_manager.py", line 227, in data
raise MemoryError(emsg.format(self.shape, self.dtype))
MemoryError: Failed to realise the lazy data as there was not enough memory available.
The data shape would have been (132, 75, 330, 360) with dtype('float32').
Consider freeing up variables or indexing the data before trying again.
and I don't blame it - this would mean 37G of mem - absolute mega, but it really should not realize the whole of the data in this manner, after all, the reason why I'm asking for level selection here is to shrink the data
neither regridding nor vertical interpolation are lazy at the moment: https://github.com/ESMValGroup/ESMValCore/issues/674
ya, so we're a bit in the bogs since this will happen all the time when running with these sort of variables and on nodes with poor memory. BTW I just ran the ESMF regridding no problemo (the one both myself and @omeuriot were unable to run) after I have reduced the time to three years and I selected two levels, so it's not an inherent issue with its functionality - it's just really bad at telling you need more memory :grin:
here - at the point of realizing the lazy data:
File "/home/users/valeriu/anaconda3R/envs/esmvaltool_users/lib/python3.7/site-packages/iris/_data_manager.py", line 216, in data
result = as_concrete_data(self._lazy_array)
File "/home/users/valeriu/anaconda3R/envs/esmvaltool_users/lib/python3.7/site-packages/iris/_lazy_data.py", line 267, in as_concrete_data
data, = _co_realise_lazy_arrays([data])
File "/home/users/valeriu/anaconda3R/envs/esmvaltool_users/lib/python3.7/site-packages/iris/_lazy_data.py", line 230, in _co_realise_lazy_arrays
computed_arrays = da.compute(*arrays)
File "/home/users/valeriu/anaconda3R/envs/esmvaltool_users/lib/python3.7/site-packages/dask/base.py", line 444, in compute
results = schedule(dsk, keys, **kwargs)
it's trying to move about 25G of lazy data to real, and my node has only 14G of available mem, the unloader should kick me out right away. The only way we can run such variables on Jasmin is on sci3 which is dreadfully slow and hammered by everybody and their dog. If we don't select levels in advance, or regions these sort of recipes are impossible to be run - @ledm do you think we can shrink the data somehow? Or not regrid it?
OK I managed to run the 2x2 ESMF regridding without level selection (all 75 levels in), it took 14min to run a single year worth of monthly data (12 time points, 330x360 grid) - 99.9% of it was waiting on available memory (ran it on sci5 where you usually get about 14G of avail mem) -> this is pretty lamers :grin:
I had a much closer look at this and found out what the actual delay is coming from - the esmpy regridder assembles the regrid instances per level into a list - if you have a ton of those levels that thing spends forever, I parallelized the list in _regrid_esmpy.py
in #773 and for my example the regrid time decreased from 400s to 1s - pretty hefty speedup! Can @zklaus and @omeuriot have a test please and Klaus have a look at the implementation please :beer:
moved to #775
if you dont KeyboardInterrupt it'll just hang forever - I recall @schlunma @mattiarighi and @jvegasbsc have had similar issues? This one's pretty bad since this is a standard ocean recipe
recipe_diagnostic_transect.yml