add nearest neighbor lookup with KDTree for point to point distances

cchwala commented 1 month ago

This PR adds an easy to use function for point data xarray.Datasets to calculate nearest neighbor distances using scipy.spatial.KDTree. The results are returned in a nicely structured xarray.Dataset.

The example notebook is updated accordingly, also removing some of the usage of the distance matrix because getting nearest neighbors is more convenient with the introduced functions.

[x] closes #37
[x] Tests added and passed if fixing a bug or adding a new feature
[x] All code checks from pre-commit passed
[x] Added an entry in the CHANGELOG.md file

codecov[bot] commented 1 month ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 100.00%. Comparing base (61c4063) to head (47155f0). Report is 12 commits behind head on main.

Additional details and impacted files

```diff @@ Coverage Diff @@ ## main #49 +/- ## ========================================== Coverage 100.00% 100.00% ========================================== Files 3 4 +1 Lines 43 224 +181 ========================================== + Hits 43 224 +181 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

cchwala commented 1 month ago

IMO the solution I currently have for returning the distance and neighbor IDs is good. It is borrowed and modified from #43.

I am not sure if is good to have np.nan as entry in the array of neighbor_id because this results in a dtype='object' because strings and floats are mixed. The advantage of having np.nan there is that one can easily drop these entries in xarray via xarray.DataArray..dropna() e.g. as I do in the example notebook to get the neighbor IDs for a certain PWS which I can then use to select data or locations from all neighbors:

neighbor_ids = closest_neigbors.sel(id=pws_id).neighbor_id.dropna(dim="n_closest")
ds_pws_neighbors = ds_pws.sel(id=neighbor_ids)

The .sel() of course only works if there is no non-matching ID in neighbor_ids.

cchwala commented 1 month ago

@eoydvin It would be good if you could have a look at my code and notebook, to check it, but also to see how it fits with your PR #43.

eoydvin commented 1 month ago

IMO the solution I currently have for returning the distance and neighbor IDs is good. It is borrowed and modified from #43.

I am not sure if is good to have np.nan as entry in the array of neighbor_id because this results in a dtype='object' because strings and floats are mixed. The advantage of having np.nan there is that one can easily drop these entries in xarray via xarray.DataArray..dropna() e.g. as I do in the example notebook to get the neighbor IDs for a certain PWS which I can then use to select data or locations from all neighbors:
neighbor_ids = closest_neigbors.sel(id=pws_id).neighbor_id.dropna(dim="n_closest")
ds_pws_neighbors = ds_pws.sel(id=neighbor_ids)
The .sel() of course only works if there is no non-matching ID in neighbor_ids.

IMO the solution I currently have for returning the distance and neighbor IDs is good. It is borrowed and modified from #43.

I am not sure if is good to have np.nan as entry in the array of neighbor_id because this results in a dtype='object' because strings and floats are mixed. The advantage of having np.nan there is that one can easily drop these entries in xarray via xarray.DataArray..dropna() e.g. as I do in the example notebook to get the neighbor IDs for a certain PWS which I can then use to select data or locations from all neighbors:
neighbor_ids = closest_neigbors.sel(id=pws_id).neighbor_id.dropna(dim="n_closest")
ds_pws_neighbors = ds_pws.sel(id=neighbor_ids)
The .sel() of course only works if there is no non-matching ID in neighbor_ids.

Hm, good point. In the newest version of #43 I use None instead of np.nan when I first create the array using np.full([ds_cmls.cml_id.size, n_closest], None). Im not sure what is the best way to do it, but maybe it is good if the two functions do it the same way?

cchwala commented 1 month ago

Im not sure what is the best way to do it, but maybe it is good if the two functions do it the same way?

What is returned by your function in #43 and by my function here, should have the exact same structure and conventions.

I suggest that we discuss None vs np.nan vs something_else via chat or video call during the next days. Unless there are already now good arguments for one of the options.

cchwala commented 1 month ago

Conclusion from discussion with @eoydvin

Returning and xr.Dataset with the current structure is good. He will do that in #43, too.
It is better to have None instead of np.nan as entry in neighbor_id for the cases where no neighbor is found within max_distance. This is preferred because it makes more sense from a semantic point of view, to say that there is none neighbor ID. Using .dropna() e.g. closest_neigbors.neighbor_id.sel(id='ams12').dropna(dim='n_closest') still works with the entries being None.
We do not want to have all coords from ds_points in the returned dataset, only the id coordinate.

cchwala commented 1 month ago

@eoydvin I implemented the points 2. and 3. from above. I also cleaned up the code and the notebook, which I also extended a bit. Please have a look again at this PR.

OpenSenseAction / poligrain

add nearest neighbor lookup with KDTree for point to point distances #49

Codecov Report