Open JochenSeidel opened 1 month ago
The default for n_closest
in the underlying function scipy.spatial.KDTree.query
is k=1
, where k
is the number of neighbors.
If we make n_closest
optional, I suggest the default would be to return only the first neighbor. We can of course select another default. But it would IMO be strange to set n_closest
to 100. Because, why not 1000? Also, the point of doing the nearest neighbor lookup is to have some limits, because the size of the returned matrix is N_stations
x n_closest
.
We cannot omit n_closest
, because scipy.spatial.KDTree,query
always gets a k
, which is k=1
as default, as written above.
So, the only two options I see are:
n_closest
as function argument, which is the current behavior.n_closest=1
Since, I expect that most people do not want n_closest=1
very often, e.g. when working with "buddy checks" of PWS, I prefer option 1. But this can be discussed, e.g. during the meeting today.
Maybe @lepetersson could have a look at this discussion before the meeting today.
The default for
n_closest
in the underlying functionscipy.spatial.KDTree.query
isk=1
, wherek
is the number of neighbors.If we make
n_closest
optional, I suggest the default would be to return only the first neighbor. We can of course select another default. But it would IMO be strange to setn_closest
to 100. Because, why not 1000? Also, the point of doing the nearest neighbor lookup is to have some limits, because the size of the returned matrix isN_stations
xn_closest
.We cannot omit
n_closest
, becausescipy.spatial.KDTree,query
always gets ak
, which isk=1
as default, as written above.So, the only two options I see are:
1. Require `n_closest` as function argument, which is the current behavior. 2. Set default `n_closest=1`
Since, I expect that most people do not want
n_closest=1
very often, e.g. when working with "buddy checks" of PWS, I prefer option 1. But this can be discussed, e.g. during the meeting today.
Would it make sense to setn_closest
based on the number of stations within the max_distance
as default?
The default for
n_closest
in the underlying functionscipy.spatial.KDTree.query
isk=1
, wherek
is the number of neighbors. If we maken_closest
optional, I suggest the default would be to return only the first neighbor. We can of course select another default. But it would IMO be strange to setn_closest
to 100. Because, why not 1000? Also, the point of doing the nearest neighbor lookup is to have some limits, because the size of the returned matrix isN_stations
xn_closest
. We cannot omitn_closest
, becausescipy.spatial.KDTree,query
always gets ak
, which isk=1
as default, as written above. So, the only two options I see are:1. Require `n_closest` as function argument, which is the current behavior. 2. Set default `n_closest=1`
Since, I expect that most people do not want
n_closest=1
very often, e.g. when working with "buddy checks" of PWS, I prefer option 1. But this can be discussed, e.g. during the meeting today.Would it make sense to set
n_closest
based on the number of stations within themax_distance
as default?
I think it would be more intutitive to not have to set n_closest, but rather (for example) loop through the stations and the distance matrix and create a xr dataset where the neighbour list per station is stored. In this way the user don't need to care about selecting n_closest. Perhaps you can look for the highest number of neighbours first and then set n_closest to that. Let's discuss later. Referring to this documentation
Conclusion based on discussion today:
n_closest
should be the number of potential neighboring stationsmax_distance
should be a required parametern_closest
has to be set if input datasets are really large, because resulting data will be even largerAdding some thoughts to the discussion: One workaround using current code could be to select the station you want to find all neighbors to like this:
closest_neigbors = plg.spatial.get_closest_points_to_point(
ds_points=ds_pws,
ds_points_neighbors=ds_pws.sel(id = pws_id),
max_distance=max_distance,
n_closest=1, # not relevant, we select the first station regardless
).isel(n_closest = 0)
Then all closest neighbors to station "pws_id" can be found like this:
neighbor_ids = closest_neigbors.where(closest_neigbors.neighbor_id != None, drop = True).id
ds_pws_neighbors = ds_pws.sel(id=neighbor_ids)
plt.scatter(ds_pws_neighbors.x, ds_pws_neighbors.y, c="C0", s=30)
Its not the most intuitive syntax and for it to work you would have to loop through all the PWSs, at the cost of computational time. Another potential option could be to use the kd_tree.query_ball_point like I do in #43, possibly by calculating all distances at once or by looking through all stations... If the behavior of get_closest_points_to_point is changed we should consider adapting the output in #43 to be similar?
_nclosest is a required input for the _plg.spatial.get_closest_points_topoint function. I would suggest to make it optional since the number of neighbouring stations might be irrelevant for some calculations, e.g. the indicator correlation. Optionally, the number could be set to a very high number...
PS: I wanted to label this as "discussion" but I don't know how that works....