Regrid hangs for ocean variable sea water salinity for standard 2x2 grid

valeriupredoi commented 4 years ago

2020-07-21 16:22:42,761 UTC [7554] DEBUG   Running regrid(sea_water_salinity                         (time: 252; depth: 75; cell index along second dimension: 330; cell index along first dimension: 360)
     Dimension coordinates:
          time                                  x           -                                      -                                      -
          depth                                 -           x                                      -                                      -
          cell index along second dimension     -           -                                      x                                      -
          cell index along first dimension      -           -                                      -                                      x
     Auxiliary coordinates:
          latitude                              -           -                                      x                                      x
          longitude                             -           -                                      x                                      x
     Attributes:
          Conventions: CF-1.7 CMIP-6.2
          activity_id: CMIP
          branch_method: standard
          branch_time_in_child: 0.0
          branch_time_in_parent: 144000.0
          cmor_version: 3.4.0
          comment: mo: This variable is reported using a z* coordinate system rather than...
          cv_version: 6.2.20.1
          data_specs_version: 01.00.29
          experiment: all-forcing simulation of the recent past
          experiment_id: historical
          external_variables: areacello volcello
          forcing_index: 2
          frequency: mon
          further_info_url: https://furtherinfo.es-doc.org/CMIP6.MOHC.UKESM1-0-LL.historical.none....
          grid: Native eORCA1 tripolar primarily 1 deg with meridional refinement down...
          grid_label: gn
          initialization_index: 1
          institution: Met Office Hadley Centre, Fitzroy Road, Exeter, Devon, EX1 3PB, UK
          institution_id: MOHC
          license: CMIP6 model data produced by the Met Office Hadley Centre is licensed under...
          mip_era: CMIP6
          mo_runid: u-bc179
          nominal_resolution: 100 km
          original_name: mo: (variable_name: so)
          original_units: 1e-3
          parent_activity_id: CMIP
          parent_experiment_id: piControl
          parent_mip_era: CMIP6
          parent_source_id: UKESM1-0-LL
          parent_time_units: days since 1850-1-1 00:00:00
          parent_variant_label: r1i1p1f2
          physics_index: 1
          product: model-output
          realization_index: 1
          realm: ocean
          source: UKESM1.0-LL (2018): 
aerosol: UKCA-GLOMAP-mode
atmos: MetUM-HadGEM3-GA7.1...
          source_id: UKESM1-0-LL
          source_type: AOGCM AER BGC CHEM
          sub_experiment: none
          sub_experiment_id: none
          table_id: Omon
          table_info: Creation Date:(13 December 2018) MD5:2b12b5db6db112aa8b8b0d6c1645b121
          title: UKESM1-0-LL output prepared for CMIP6
          variable_id: so
          variant_label: r1i1p1f2
     Cell methods:
          mean where sea: area
          mean: time, {'target_grid': '2x2', 'scheme': 'linear'})
<Ctrl-C>
Process ForkPoolWorker-2:
Traceback (most recent call last):
  File "/home/users/valeriu/anaconda3R/envs/esmvaltool_users/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/home/users/valeriu/anaconda3R/envs/esmvaltool_users/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/home/users/valeriu/anaconda3R/envs/esmvaltool_users/lib/python3.7/multiprocessing/pool.py", line 110, in worker
    task = get()
  File "/home/users/valeriu/anaconda3R/envs/esmvaltool_users/lib/python3.7/multiprocessing/queues.py", line 352, in get
    res = self._reader.recv_bytes()
  File "/home/users/valeriu/anaconda3R/envs/esmvaltool_users/lib/python3.7/multiprocessing/connection.py", line 216, in recv_bytes
    buf = self._recv_bytes(maxlength)
  File "/home/users/valeriu/anaconda3R/envs/esmvaltool_users/lib/python3.7/multiprocessing/connection.py", line 407, in _recv_bytes
    buf = self._recv(4)
  File "/home/users/valeriu/anaconda3R/envs/esmvaltool_users/lib/python3.7/multiprocessing/connection.py", line 379, in _recv
    chunk = read(handle, remaining)
KeyboardInterrupt
Process ForkPoolWorker-1:
Traceback (most recent call last):
  File "/home/users/valeriu/anaconda3R/envs/esmvaltool_users/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/home/users/valeriu/anaconda3R/envs/esmvaltool_users/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/home/users/valeriu/anaconda3R/envs/esmvaltool_users/lib/python3.7/multiprocessing/pool.py", line 121, in worker
    result = (True, func(*args, **kwds))
  File "/home/users/valeriu/esmvalcore_users/esmvalcore/_task.py", line 730, in _run_task
    output_files = task.run()
  File "/home/users/valeriu/esmvalcore_users/esmvalcore/_task.py", line 242, in run
    self.output_files = self._run(input_files)
  File "/home/users/valeriu/esmvalcore_users/esmvalcore/preprocessor/__init__.py", line 428, in _run
    product.apply(step, self.debug)
  File "/home/users/valeriu/esmvalcore_users/esmvalcore/preprocessor/__init__.py", line 296, in apply
    self.cubes = preprocess(self.cubes, step, **self.settings[step])
  File "/home/users/valeriu/esmvalcore_users/esmvalcore/preprocessor/__init__.py", line 241, in preprocess
    result.append(_run_preproc_function(function, item, settings))
  File "/home/users/valeriu/esmvalcore_users/esmvalcore/preprocessor/__init__.py", line 224, in _run_preproc_function
    return function(items, **kwargs)
  File "/home/users/valeriu/esmvalcore_users/esmvalcore/preprocessor/_regrid.py", line 335, in regrid
    cube = esmpy_regrid(cube, target_grid, scheme)
  File "/home/users/valeriu/esmvalcore_users/esmvalcore/preprocessor/_regrid_esmpy.py", line 332, in regrid
    res = map_slices(src, regridder, src_rep, dst_rep)
  File "/home/users/valeriu/esmvalcore_users/esmvalcore/preprocessor/_mapping.py", line 237, in map_slices
    res = func(src[src_ind])
  File "/home/users/valeriu/esmvalcore_users/esmvalcore/preprocessor/_regrid_esmpy.py", line 210, in regridder
    res[i, ...] = esmf_regridder(src[i])
  File "/home/users/valeriu/esmvalcore_users/esmvalcore/preprocessor/_regrid_esmpy.py", line 182, in regridder
    data = src.data
  File "/home/users/valeriu/anaconda3R/envs/esmvaltool_users/lib/python3.7/site-packages/iris/cube.py", line 1726, in data
    return self._data_manager.data
  File "/home/users/valeriu/anaconda3R/envs/esmvaltool_users/lib/python3.7/site-packages/iris/_data_manager.py", line 216, in data
    result = as_concrete_data(self._lazy_array)
  File "/home/users/valeriu/anaconda3R/envs/esmvaltool_users/lib/python3.7/site-packages/iris/_lazy_data.py", line 267, in as_concrete_data
    data, = _co_realise_lazy_arrays([data])
  File "/home/users/valeriu/anaconda3R/envs/esmvaltool_users/lib/python3.7/site-packages/iris/_lazy_data.py", line 230, in _co_realise_lazy_arrays
    computed_arrays = da.compute(*arrays)
  File "/home/users/valeriu/anaconda3R/envs/esmvaltool_users/lib/python3.7/site-packages/dask/base.py", line 444, in compute
    results = schedule(dsk, keys, **kwargs)
  File "/home/users/valeriu/anaconda3R/envs/esmvaltool_users/lib/python3.7/site-packages/dask/threaded.py", line 84, in get
    **kwargs
  File "/home/users/valeriu/anaconda3R/envs/esmvaltool_users/lib/python3.7/site-packages/dask/local.py", line 475, in get_async
    key, res_info, failed = queue_get(queue)
  File "/home/users/valeriu/anaconda3R/envs/esmvaltool_users/lib/python3.7/site-packages/dask/local.py", line 133, in queue_get
    return q.get()
  File "/home/users/valeriu/anaconda3R/envs/esmvaltool_users/lib/python3.7/queue.py", line 170, in get
    self.not_empty.wait()
  File "/home/users/valeriu/anaconda3R/envs/esmvaltool_users/lib/python3.7/threading.py", line 296, in wait
    waiter.acquire()
KeyboardInterrupt
2020-07-21 16:17:34,519 UTC [4082] INFO    Maximum memory used (estimate): 1.5 GB
2020-07-21 16:17:34,520 UTC [4082] INFO    Sampled every second. It may be inaccurate if short but high spikes in memory consumption occur.
Traceback (most recent call last):
  File "/home/users/valeriu/anaconda3R/envs/esmvaltool_users/bin/esmvaltool", line 11, in <module>
    load_entry_point('ESMValCore', 'console_scripts', 'esmvaltool')()
  File "/home/users/valeriu/esmvalcore_users/esmvalcore/_main.py", line 430, in run
    fire.Fire(ESMValTool())
  File "/home/users/valeriu/anaconda3R/envs/esmvaltool_users/lib/python3.7/site-packages/fire/core.py", line 138, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/home/users/valeriu/anaconda3R/envs/esmvaltool_users/lib/python3.7/site-packages/fire/core.py", line 468, in _Fire
    target=component.__name__)
  File "/home/users/valeriu/anaconda3R/envs/esmvaltool_users/lib/python3.7/site-packages/fire/core.py", line 672, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/home/users/valeriu/esmvalcore_users/esmvalcore/_main.py", line 407, in run
    process_recipe(recipe_file=recipe, config_user=cfg)
  File "/home/users/valeriu/esmvalcore_users/esmvalcore/_main.py", line 98, in process_recipe
    recipe.run()
  File "/home/users/valeriu/esmvalcore_users/esmvalcore/_recipe.py", line 1359, in run
    max_parallel_tasks=self._cfg['max_parallel_tasks'])
  File "/home/users/valeriu/esmvalcore_users/esmvalcore/_task.py", line 650, in run_tasks
    _run_tasks_parallel(tasks, max_parallel_tasks)
  File "/home/users/valeriu/esmvalcore_users/esmvalcore/_task.py", line 700, in _run_tasks_parallel
    time.sleep(0.1)
KeyboardInterrupt

if you dont KeyboardInterrupt it'll just hang forever - I recall @schlunma @mattiarighi and @jvegasbsc have had similar issues? This one's pretty bad since this is a standard ocean recipe recipe_diagnostic_transect.yml

bouweandela commented 4 years ago

Could it be this? https://github.com/ESMValGroup/ESMValCore/issues/430#issuecomment-575723809

valeriupredoi commented 4 years ago

could be but why is it creeping up here? I need to investigate moar... :beer:

bouweandela commented 4 years ago

The problem occurs if one of the input files is opened by _recipe.py, e.g. to extract the vertical levels to interpolate to, or by _data_finder.py to find the start and end year (in the case of CMIP3 data) and then later in another process by the preprocessor.

bouweandela commented 4 years ago

Maybe you could attach the recipe so other people can try to reproduce the issue? I cannot find recipe_diagnostic_transect.yml in the ESMValTool.

valeriupredoi commented 4 years ago

yeah sorry @bouweandela been busy looking at other stuff in the meantime, here it is, I'll start poking around now too, on Jasmin :beer: recipe_sections.txt OM_diagnostic_transects.py.txt also pinging @omeuriot since she's been running it :beer:

valeriupredoi commented 4 years ago

I know what's the problem - the ESMF regridder is trying to construct the regridder but it's not getting enough memory allocated, and it stays there in limbo (probably waiting for memory to be freed on the node), if you use select_levels instead, no ESMF mumbojumbo, you'll get the memory allocation right away:

  File "/home/users/valeriu/esmvalcore_users/esmvalcore/preprocessor/__init__.py", line 224, in _run_preproc_function
    return function(items, **kwargs)
  File "/home/users/valeriu/esmvalcore_users/esmvalcore/preprocessor/_regrid.py", line 532, in extract_levels
    cube, src_levels, levels, scheme, extrap_scheme)
  File "/home/users/valeriu/esmvalcore_users/esmvalcore/preprocessor/_regrid.py", line 436, in _vertical_interpolate
    if np.ma.is_masked(cube.data):
  File "/home/users/valeriu/anaconda3R/envs/esmvaltool_users/lib/python3.7/site-packages/iris/cube.py", line 1726, in data
    return self._data_manager.data
  File "/home/users/valeriu/anaconda3R/envs/esmvaltool_users/lib/python3.7/site-packages/iris/_data_manager.py", line 227, in data
    raise MemoryError(emsg.format(self.shape, self.dtype))
MemoryError: Failed to realise the lazy data as there was not enough memory available.
The data shape would have been (132, 75, 330, 360) with dtype('float32').
 Consider freeing up variables or indexing the data before trying again.

valeriupredoi commented 4 years ago

and I don't blame it - this would mean 37G of mem - absolute mega, but it really should not realize the whole of the data in this manner, after all, the reason why I'm asking for level selection here is to shrink the data

bouweandela commented 4 years ago

neither regridding nor vertical interpolation are lazy at the moment: https://github.com/ESMValGroup/ESMValCore/issues/674

valeriupredoi commented 4 years ago

ya, so we're a bit in the bogs since this will happen all the time when running with these sort of variables and on nodes with poor memory. BTW I just ran the ESMF regridding no problemo (the one both myself and @omeuriot were unable to run) after I have reduced the time to three years and I selected two levels, so it's not an inherent issue with its functionality - it's just really bad at telling you need more memory :grin:

valeriupredoi commented 4 years ago

here - at the point of realizing the lazy data:

 File "/home/users/valeriu/anaconda3R/envs/esmvaltool_users/lib/python3.7/site-packages/iris/_data_manager.py", line 216, in data
    result = as_concrete_data(self._lazy_array)
  File "/home/users/valeriu/anaconda3R/envs/esmvaltool_users/lib/python3.7/site-packages/iris/_lazy_data.py", line 267, in as_concrete_data
    data, = _co_realise_lazy_arrays([data])
  File "/home/users/valeriu/anaconda3R/envs/esmvaltool_users/lib/python3.7/site-packages/iris/_lazy_data.py", line 230, in _co_realise_lazy_arrays
    computed_arrays = da.compute(*arrays)
  File "/home/users/valeriu/anaconda3R/envs/esmvaltool_users/lib/python3.7/site-packages/dask/base.py", line 444, in compute
    results = schedule(dsk, keys, **kwargs)

it's trying to move about 25G of lazy data to real, and my node has only 14G of available mem, the unloader should kick me out right away. The only way we can run such variables on Jasmin is on sci3 which is dreadfully slow and hammered by everybody and their dog. If we don't select levels in advance, or regions these sort of recipes are impossible to be run - @ledm do you think we can shrink the data somehow? Or not regrid it?

valeriupredoi commented 4 years ago

OK I managed to run the 2x2 ESMF regridding without level selection (all 75 levels in), it took 14min to run a single year worth of monthly data (12 time points, 330x360 grid) - 99.9% of it was waiting on available memory (ran it on sci5 where you usually get about 14G of avail mem) -> this is pretty lamers :grin:

valeriupredoi commented 4 years ago

I had a much closer look at this and found out what the actual delay is coming from - the esmpy regridder assembles the regrid instances per level into a list - if you have a ton of those levels that thing spends forever, I parallelized the list in _regrid_esmpy.py in #773 and for my example the regrid time decreased from 400s to 1s - pretty hefty speedup! Can @zklaus and @omeuriot have a test please and Klaus have a look at the implementation please :beer:

valeriupredoi commented 4 years ago

moved to #775

ESMValGroup / ESMValCore

Regrid hangs for ocean variable sea water salinity for standard 2x2 grid #724