OpenSenseAction / poligrain

Simplify common tasks for working with point, line and gridded sensor data, focusing on rainfall observations.
https://poligrain.readthedocs.io
BSD 3-Clause "New" or "Revised" License
2 stars 10 forks source link

Set default value for `n_closest` for finding nearest neighbors with `get_closest_points_to_point` #51

Open JochenSeidel opened 1 month ago

JochenSeidel commented 1 month ago

_nclosest is a required input for the _plg.spatial.get_closest_points_topoint function. I would suggest to make it optional since the number of neighbouring stations might be irrelevant for some calculations, e.g. the indicator correlation. Optionally, the number could be set to a very high number...

PS: I wanted to label this as "discussion" but I don't know how that works....

cchwala commented 1 month ago

The default for n_closest in the underlying function scipy.spatial.KDTree.query is k=1, where k is the number of neighbors.

If we make n_closest optional, I suggest the default would be to return only the first neighbor. We can of course select another default. But it would IMO be strange to set n_closest to 100. Because, why not 1000? Also, the point of doing the nearest neighbor lookup is to have some limits, because the size of the returned matrix is N_stations x n_closest.

We cannot omit n_closest, because scipy.spatial.KDTree,query always gets a k, which is k=1 as default, as written above.

So, the only two options I see are:

  1. Require n_closest as function argument, which is the current behavior.
  2. Set default n_closest=1

Since, I expect that most people do not want n_closest=1 very often, e.g. when working with "buddy checks" of PWS, I prefer option 1. But this can be discussed, e.g. during the meeting today.

cchwala commented 1 month ago

Maybe @lepetersson could have a look at this discussion before the meeting today.

JochenSeidel commented 1 month ago

The default for n_closest in the underlying function scipy.spatial.KDTree.query is k=1, where k is the number of neighbors.

If we make n_closest optional, I suggest the default would be to return only the first neighbor. We can of course select another default. But it would IMO be strange to set n_closest to 100. Because, why not 1000? Also, the point of doing the nearest neighbor lookup is to have some limits, because the size of the returned matrix is N_stations x n_closest.

We cannot omit n_closest, because scipy.spatial.KDTree,query always gets a k, which is k=1 as default, as written above.

So, the only two options I see are:

1. Require `n_closest` as function argument, which is the current behavior.

2. Set default `n_closest=1`

Since, I expect that most people do not want n_closest=1 very often, e.g. when working with "buddy checks" of PWS, I prefer option 1. But this can be discussed, e.g. during the meeting today.

Would it make sense to setn_closest based on the number of stations within the max_distance as default?

lepetersson commented 1 month ago

The default for n_closest in the underlying function scipy.spatial.KDTree.query is k=1, where k is the number of neighbors. If we make n_closest optional, I suggest the default would be to return only the first neighbor. We can of course select another default. But it would IMO be strange to set n_closest to 100. Because, why not 1000? Also, the point of doing the nearest neighbor lookup is to have some limits, because the size of the returned matrix is N_stations x n_closest. We cannot omit n_closest, because scipy.spatial.KDTree,query always gets a k, which is k=1 as default, as written above. So, the only two options I see are:

1. Require `n_closest` as function argument, which is the current behavior.

2. Set default `n_closest=1`

Since, I expect that most people do not want n_closest=1 very often, e.g. when working with "buddy checks" of PWS, I prefer option 1. But this can be discussed, e.g. during the meeting today.

Would it make sense to setn_closest based on the number of stations within the max_distance as default?

I think it would be more intutitive to not have to set n_closest, but rather (for example) loop through the stations and the distance matrix and create a xr dataset where the neighbour list per station is stored. In this way the user don't need to care about selecting n_closest. Perhaps you can look for the highest number of neighbours first and then set n_closest to that. Let's discuss later. Referring to this documentation

cchwala commented 1 month ago

Conclusion based on discussion today:

  1. as default n_closest should be the number of potential neighboring stations
  2. max_distance should be a required parameter
  3. add to documentation info that n_closest has to be set if input datasets are really large, because resulting data will be even larger
eoydvin commented 1 month ago

Adding some thoughts to the discussion: One workaround using current code could be to select the station you want to find all neighbors to like this:

closest_neigbors = plg.spatial.get_closest_points_to_point(
    ds_points=ds_pws,
    ds_points_neighbors=ds_pws.sel(id = pws_id),
    max_distance=max_distance,
    n_closest=1, # not relevant, we select the first station regardless
).isel(n_closest = 0)

Then all closest neighbors to station "pws_id" can be found like this:

neighbor_ids = closest_neigbors.where(closest_neigbors.neighbor_id != None,  drop = True).id
ds_pws_neighbors = ds_pws.sel(id=neighbor_ids)
plt.scatter(ds_pws_neighbors.x, ds_pws_neighbors.y, c="C0", s=30)

Its not the most intuitive syntax and for it to work you would have to loop through all the PWSs, at the cost of computational time. Another potential option could be to use the kd_tree.query_ball_point like I do in #43, possibly by calculating all distances at once or by looking through all stations... If the behavior of get_closest_points_to_point is changed we should consider adapting the output in #43 to be similar?