Add spatial fill nearest neighbor function

delgadom commented 6 years ago

prototype implementation

import numpy as np
from scipy.spatial import cKDTree

def spatial_fillna_nearest_neighbor(
        da,
        x_dim='longitude',
        y_dim='latitude',
        distance_upper_bound=np.inf,
        inplace=False):

    xy_dims = [x_dim, y_dim]
    not_xy_dims = [d for d in da.dims if d not in xy_dims]

    not_all_nans = da.notnull().any(dim=not_xy_dims)

    # get vectors of isnull, notnull flags
    stacked_isnull_flag = (~not_all_nans).stack(obs=xy_dims)
    notnull_flag = (~stacked_isnull_flag.values)
    isnull_flag = stacked_isnull_flag.values

    # get full set of xy points
    xy_full = np.vstack([stacked_isnull_flag[x_dim], stacked_isnull_flag[y_dim]]).T

    # get set of isnull, notnull xy points
    xy_isnull = xy_full[isnull_flag]
    xy_notnull = xy_full[notnull_flag]

    # build kdtree from valid points
    tree = cKDTree(xy_notnull)
    _, null_nn_notnull_indices = tree.query(
        xy_isnull, k=1, distance_upper_bound=distance_upper_bound)

    nearest_neighbor_valid = (null_nn_notnull_indices != len(xy_notnull))

    # build a mask for null values that have been successfully mapped to nearest neighbors
    isnull_and_filled_flag = isnull_flag.copy()
    isnull_and_filled_flag[isnull_flag] = nearest_neighbor_valid

    # build an indexing array with filled values pointing to their nearest neighbors
    isnull_nn_indices = np.arange(xy_full.shape[0])
    isnull_nn_indices[isnull_and_filled_flag] = (
        isnull_nn_indices[notnull_flag][null_nn_notnull_indices[nearest_neighbor_valid]])

    if not inplace:
        da = da.copy()

    all_dims = (not_xy_dims + xy_dims)
    dim_inds = [da.dims.index(d) for d in all_dims]
    res_shapes = [da.shape[i] for i in dim_inds]
    dim_sorter = [all_dims.index(d) for d in da.dims]

    da.values = (
        da
        .stack(obs=xy_dims)
        .transpose(*tuple(list(not_xy_dims) + ['obs']))
        .values[..., isnull_nn_indices]
        .reshape(res_shapes)
        .transpose(*dim_sorter))

    if not inplace:
        return da

delgadom commented 6 years ago

@kemccusker here's my implementation of a spatial nearest neighbor function. it assumes that the spatial pattern of nearest neighbors shouldn't change with time (it interpolates cells that are always NaN to cells that have at least one non-NaN value).

thinking about putting this in climate toolbox. any requests?

dgergel commented 4 years ago

@delgadom I think it would be great if we could add this to the climate toolbox. One edit that I needed to make this function run:

notnull_flag = (~stacked_isnull_flag).values

brews commented 4 years ago

Cool!

If used ~globally, do we care about bias from using rectangular, latlon grids at higher |lat|? (And maybe I'm being too nerdy and this isn't really an issue for target use cases)

The goal is just to interpolate-out NaNs, right?

dgergel commented 4 years ago

I think for now we don't need to worry too much about bias at higher latitudes.

Yep - the goal is just to interpolate out NaNs, because the interpolate_na function in xarray only works on 1-d arrays and doesn't like a multiindex if you stack dims, so functionality outside of xarray is necessary. Scipy's cKDTree function has a tolerance option that its interpolate module doesn't have so that's the motivation for using it. I also considered switching cKDTree to BallTree but decided that was too nerdy (unless you have thoughts on this @brews?)

brews commented 4 years ago

Do we need it to run fast/small and be super awesome? I wouldn't be surprised if someone already has balltree with haversine distance or something.

Edit: Mature, Grown-up Brewster says: Seriously though, Mike's solution might be good enough. I wouldn't sweat it unless its been shown to be a bottleneck.

dgergel commented 4 years ago

that's actually why I considered switching to balltree because it has the haversine option.

But then got pressed for time so I stuck with this. this actually is pretty fast, I used it on global data and it was significantly faster than my 1-d solution that involved stacking and then interpolating.

delgadom commented 4 years ago

Yeah these are all great caveats for this function for sure. We currently just use it to map near-coastal pixels to coastal pixels for areas with a landmask mismatch (e.g. comparing NASA/NEX and BEST). I'm not super worried about the error introduced from slightly too frequently grabbing values from cells above/below rather than left/right, and the intention is to explicitly prevent interpolating over large distances with the distance_upper_bound argument, so I think we're good. Nearest neighbor is waaay faster than anything messing with haversine distance.

ClimateImpactLab / climate_toolbox

Add spatial fill nearest neighbor function #14