MOC calculation on 60to30

vanroekel commented 7 years ago

I've been trying to process simulation output from a G-case (60to30) on anvil. If I do 1 year, the MOC calculates fine, but if I do ten years, I see the following

 Compute and/or plot post-processed MOC climatological streamfunction...
   Load data...
Traceback (most recent call last):
  File "./run_analysis.py", line 243, in <module>
    analysis(config)
  File "./run_analysis.py", line 182, in analysis
    moc_streamfunction(config)
  File "/blues/gpfs/home/lvanroe/MPAS-Analysis/mpas_analysis/ocean/meridional_overturning_circulation.py", line 151, in moc_streamfunction
    regionNames, dictClimo)
  File "/blues/gpfs/home/lvanroe/MPAS-Analysis/mpas_analysis/ocean/meridional_overturning_circulation.py", line 312, in _compute_moc_climo_postprocess
    calendar)
  File "/blues/gpfs/home/lvanroe/MPAS-Analysis/mpas_analysis/shared/climatology/climatology.py", line 483, in compute_climatology
    climatology = _compute_masked_mean(climatologyMonths)
  File "/blues/gpfs/home/lvanroe/MPAS-Analysis/mpas_analysis/shared/climatology/climatology.py", line 617, in _compute_masked_mean
    weights = ds_to_weights(ds)
  File "/blues/gpfs/home/lvanroe/MPAS-Analysis/mpas_analysis/shared/climatology/climatology.py", line 609, in ds_to_weights
    weights[var] = ds[var].notnull()
  File "/lcrc/group/acme/lvanroe/conda/lib/python2.7/site-packages/xarray/core/dataarray.py", line 1487, in func
    return self.__array_wrap__(f(self.variable.data, *args, **kwargs))
  File "/lcrc/group/acme/lvanroe/conda/lib/python2.7/site-packages/xarray/core/ops.py", line 205, in func
    return f(self, *args, **kwargs)
  File "/lcrc/group/acme/lvanroe/conda/lib/python2.7/site-packages/pandas/types/missing.py", line 205, in notnull
    res = isnull(obj)
  File "/lcrc/group/acme/lvanroe/conda/lib/python2.7/site-packages/pandas/types/missing.py", line 45, in isnull
    return _isnull(obj)
  File "/lcrc/group/acme/lvanroe/conda/lib/python2.7/site-packages/pandas/types/missing.py", line 59, in _isnull_new
    return _isnull_ndarraylike(np.asarray(obj))
  File "/lcrc/group/acme/lvanroe/conda/lib/python2.7/site-packages/numpy/core/numeric.py", line 482, in asarray
    return array(a, dtype, copy=False, order=order)
  File "/lcrc/group/acme/lvanroe/conda/lib/python2.7/site-packages/dask/array/core.py", line 1056, in __array__
    x = self.compute()
  File "/lcrc/group/acme/lvanroe/conda/lib/python2.7/site-packages/dask/base.py", line 95, in compute
    (result,) = compute(self, traverse=False, **kwargs)
  File "/lcrc/group/acme/lvanroe/conda/lib/python2.7/site-packages/dask/base.py", line 202, in compute
    results = get(dsk, keys, **kwargs)
  File "/lcrc/group/acme/lvanroe/conda/lib/python2.7/site-packages/dask/threaded.py", line 76, in get
    **kwargs)
  File "/lcrc/group/acme/lvanroe/conda/lib/python2.7/site-packages/dask/async.py", line 500, in get_async
    raise(remote_exception(res, tb))
dask.async.MemoryError:

Traceback
---------
  File "/lcrc/group/acme/lvanroe/conda/lib/python2.7/site-packages/dask/async.py", line 266, in execute_task
    result = _execute_task(task, data)
  File "/lcrc/group/acme/lvanroe/conda/lib/python2.7/site-packages/dask/async.py", line 247, in _execute_task
    return func(*args2)
  File "/lcrc/group/acme/lvanroe/conda/lib/python2.7/site-packages/dask/array/core.py", line 64, in getarray
    c = np.asarray(c)
  File "/lcrc/group/acme/lvanroe/conda/lib/python2.7/site-packages/numpy/core/numeric.py", line 482, in asarray
    return array(a, dtype, copy=False, order=order)
  File "/lcrc/group/acme/lvanroe/conda/lib/python2.7/site-packages/xarray/core/indexing.py", line 400, in __array__
    return np.asarray(self.array, dtype=dtype)
  File "/lcrc/group/acme/lvanroe/conda/lib/python2.7/site-packages/numpy/core/numeric.py", line 482, in asarray
    return array(a, dtype, copy=False, order=order)
  File "/lcrc/group/acme/lvanroe/conda/lib/python2.7/site-packages/xarray/core/indexing.py", line 375, in __array__
    return np.asarray(array[self.key], dtype=None)
  File "/lcrc/group/acme/lvanroe/conda/lib/python2.7/site-packages/numpy/core/numeric.py", line 482, in asarray
    return array(a, dtype, copy=False, order=order)
  File "/lcrc/group/acme/lvanroe/conda/lib/python2.7/site-packages/xarray/core/indexing.py", line 375, in __array__
    return np.asarray(array[self.key], dtype=None)
  File "/lcrc/group/acme/lvanroe/conda/lib/python2.7/site-packages/xarray/backends/netCDF4_.py", line 60, in __getitem__
    data = getitem(self.get_array(), key)
  File "netCDF4/_netCDF4.pyx", line 3676, in netCDF4._netCDF4.Variable.__getitem__ (netCDF4/_netCDF4.c:37318)

I have checked my conda environment to verify it is up to date. Is there some config option I need to set? I saw some file percent option and a file chunk size. Do I need to set those differently? I'm using the default values for both now.

pwolfram commented 7 years ago

@vanroekel, this means we need to set a smaller chunk size, e.g., make https://github.com/MPAS-Dev/MPAS-Analysis/blob/develop/config.default#L77 a smaller number.

Some background on this -- I previously raised some issues (https://github.com/pydata/xarray/issues/1338 and https://github.com/dask/dask/issues/2138) with the xarray and dask developers and if I understand correctly, in the future the plan is to raise a more meaningful warning.

vanroekel commented 7 years ago

@pwolfram is there any guidance on where this value should be set?

pwolfram commented 7 years ago

@vanroekel, please see http://xarray.pydata.org/en/stable/dask.html#chunking-and-performance, specifically:

A good rule of thumb to create arrays with a minimum chunksize of at least one million elements (e.g., a 1000x1000 matrix). With large arrays (10+ GB), the cost of queueing up dask operations can be noticeable, and you may need even larger chunksizes.

vanroekel commented 7 years ago

@pwolfram I was able to compute 10 years of MOC by experimenting with the chunk size. It was a bit frustrating to have to play with this value multiple times to get this to work. Is there a way to improve documentation on how to set this (your explanation above suggests to me that the maxChunkSize should be increased not decreased)? Or better yet set this automatically? I can imagine someone on the coupled team getting frustrated (as I was) trying to figure out how to set that parameter. Given the error that is printed (dask.async.memory) I don't see how a user would know to go to config.default and change that parameter, much less how to change the parameter.

vanroekel commented 7 years ago

I just checked the output from the 10 year run I discussed above and I don't see any plots. Looking back at the output, it did not finish (I see "Killed") after 90 minutes of computation. Do you think I should play with chunkSize more?

vanroekel commented 7 years ago

A separate question, has the MOC calculation been run on edison for Chris 60to30 coupled case for multiple years? I'm wondering if this is an anvil specific issue.

milenaveneziani commented 7 years ago

Let's discuss this tomorrow and make it a priority, since the MOC is in the v0.2 version that @xylar and I are working on pulling into ACMEPreAndProcessing. Thanks for all your work, @vanroekel!

milenaveneziani commented 7 years ago

Yes, I did run the MOC on the beta1 runs, even 30 years. This was before we had the maxchunk fix (but I basically was lucky to never run into a memory problem on the login node..).

vanroekel commented 7 years ago

Confirming that this appears to be an anvil issue. I can run this on edison (10 years of beta1_2) with no issues. Still no luck on anvil. I have done many tests across a wide range of chunksizes, with no success.

pwolfram commented 7 years ago

After talking with @vanroekel it sounds like an anvil-specific issue. We are trying this with all conda-forge packages and will try again.

vanroekel commented 7 years ago

@pwolfram I've updated all packages to the latest and greatest and all use conda-forge, but still no dice.

I'll have to leave this for now. I don't have time to work on this anymore.

milenaveneziani commented 7 years ago

@vanroekel: things work for you now, with the caching in place, right? if so, I think we can close this.

vanroekel commented 7 years ago

This issue has been addressed by #177

MPAS-Dev / MPAS-Analysis

MOC calculation on 60to30 #173