coecms / xmhw

Xarray version of Marine Heatwaves code by Eric Olivier
https://xmhw.readthedocs.io/en/latest/
Apache License 2.0
21 stars 10 forks source link

management of Nans #37

Closed paolap closed 2 years ago

paolap commented 2 years ago

Currently we are managing nans by using np.nanpercentile and np.nanmean to calculate climatologies. This is inefficient as if there are no nans in the time series np.percentile np.mean would be much faster. So we decided to change the way we handle nans. The standard workflow will assume there are few random nans in the each grid point time series unless a grid point is land and then all points are nans.

land_check function will eliminate by default all grid cells that have all nans along time axis. This would be modified to accept the "any" option, so we can eliminate all grid point that have even 1 nan point. This choice will be left to the user.

Before running anything there will be an option to remove potential nans from the time series by interpolation. Still to be defined how to actually do this, in original marineHeatWave code there's an attempt to pad blocks of consecutive missing values:

Pad missing values for all consecutive missing blocks of length <= maxPadLength

if maxPadLength:
    temp = pad(temp, maxPadLength=maxPadLength)
    tempClim = pad(tempClim, maxPadLength=maxPadLength)

So in this way a user has the following options:

1) they are sure there are no nans use the default settings: no interpolating, and dropnan=all in land_check 2) the user suspect only few nans: interpolating, and dropnan=all in land_check 3) potentially many nans and/or user want to be more restrictive: no intepolating, dropnan=any in land_check

In any case then it should be safe to run np.percentile or any function based on it in preference to np.nanpercentile.

paolap commented 2 years ago

Changes to implement new strategy:

def threshold(temp, tdim='time', climatologyPeriod=[None,None], pctile=90, windowHalfWidth=5, smoothPercentile=True, smoothPercentileWidth=31, maxPadLength=None, coldSpells=False, Ly=False, anynans=False, skipna=False) Added a "anynas" option to land_check, and to threshold and detect. By default is False , if true dropnan how is 'any' rather than 'all'

Changed from "pad" function to xarray.interpolate_na function to interpolate missing values. This is still trigged by a maxPadLength different from None, which is still the default. It's up to the user to change this is they suspect nans values are present in their time series. In that case they can now either interpolate them at the start or use nanpercentile option (see below).

skipna - by default is False so percentile and mean functions are used to calculate threshold, if skipna is True then the equivalent of nanpercentile and nanmean is used instead. As the first option is faster is used as default alternative branch commit https://github.com/coecms/xmhw/commit/19c7e55fd9889f8b9f1e14950e4f9686a6b2a60b