Currently we are managing nans by using np.nanpercentile and np.nanmean to calculate climatologies. This is inefficient as if there are no nans in the time series np.percentile np.mean would be much faster. So we decided to change the way we handle nans. The standard workflow will assume there are few random nans in the each grid point time series unless a grid point is land and then all points are nans.

land_check function will eliminate by default all grid cells that have all nans along time axis. This would be modified to accept the "any" option, so we can eliminate all grid point that have even 1 nan point. This choice will be left to the user.

Before running anything there will be an option to remove potential nans from the time series by interpolation. Still to be defined how to actually do this, in original marineHeatWave code there's an attempt to pad blocks of consecutive missing values:

Pad missing values for all consecutive missing blocks of length <= maxPadLength

if maxPadLength:
    temp = pad(temp, maxPadLength=maxPadLength)
    tempClim = pad(tempClim, maxPadLength=maxPadLength)

So in this way a user has the following options:

1) they are sure there are no nans use the default settings: no interpolating, and dropnan=all in land_check 2) the user suspect only few nans: interpolating, and dropnan=all in land_check 3) potentially many nans and/or user want to be more restrictive: no intepolating, dropnan=any in land_check

In any case then it should be safe to run np.percentile or any function based on it in preference to np.nanpercentile.

Changes to implement new strategy:

def threshold(temp, tdim='time', climatologyPeriod=[None,None], pctile=90, windowHalfWidth=5, smoothPercentile=True, smoothPercentileWidth=31, maxPadLength=None, coldSpells=False, Ly=False, anynans=False, skipna=False) Added a "anynas" option to land_check, and to threshold and detect. By default is False , if true dropnan how is 'any' rather than 'all'

Changed from "pad" function to xarray.interpolate_na function to interpolate missing values. This is still trigged by a maxPadLength different from None, which is still the default. It's up to the user to change this is they suspect nans values are present in their time series. In that case they can now either interpolate them at the start or use nanpercentile option (see below).

skipna - by default is False so percentile and mean functions are used to calculate threshold, if skipna is True then the equivalent of nanpercentile and nanmean is used instead. As the first option is faster is used as default alternative branch commit https://github.com/coecms/xmhw/commit/19c7e55fd9889f8b9f1e14950e4f9686a6b2a60b

coecms / xmhw

management of Nans #37

Pad missing values for all consecutive missing blocks of length <= maxPadLength