Fast nearest neigbor lookup with KDTree

cchwala commented 3 months ago

For very large number of sensors, above 10,000, the distance matrix calculation gets pretty slow (something like 30 sec for 10,000 sensors against themselfs, if I recall correctly) and this scales quadratically (I guess...). Hence we might also want to provide the option to calculat nearest neighbors with scipy.spatial.KDTree, since we typically only use sensors within a certain distance (max. some tens of kilometers) which means that most parts of the distance matrix is not used if our domain is large e.g. for country-wide data in Germany.

Originally posted by @cchwala in https://github.com/OpenSenseAction/poligrain/issues/15#issuecomment-1924644621

cchwala commented 3 months ago

Note that there is also the option in scipy to calculate a sparse distance matrix using KDTree.

Maybe we could make it an option to use this sparse version in the current function for the calculation of the distance matrix.

We might hover, also switch to the simple nearest-neigbor lookup via KDTree.query because this is most often the use case, e.g. for QC with neighboring stations as we do it in pypwsqc.

cchwala commented 1 month ago

It should be fairly easy to use the sparse distance calculations. We just have to exchange the code in https://github.com/OpenSenseAction/poligrain/blob/5832645bd88122ad812cecc6df27eab99bc71204/src/poligrain/spatial.py#L113 with the sparse version.

We would still need to decide if we want a function that simply returns a list of nearest neighbors for a specific point using KDtree and query. Maybe it is enough to have the sparse distance matrix with nice axis labels based on IDs of the stations that have been used.

cchwala commented 1 month ago

Some notes based on preliminary tests:

The "empty" portions of a sparse matrix are treated as zeros when converting to a dense matrix. Hence, there is no way to distinguish between zeros from distance calculation between equal coordinates and distance calculations that have not been carried out because distance > max_distance. If one is aware of that peculiarity, that will not cause problems, but if somebody would not exclude all IDs with zero distance (which also contains IDs from far-away sensors) this might results in undesired behavior of code that works okay with a dense distance matrix calculations.
While a distance matrix is nice to get an overview of all distance with a clear relationship between all locations in an easy 2D matrix layout, there is no direct need to use a distance matrix in any of our functions. There is currently a distance matrix being used in the indicator correlation code (see code snippet here), but we might also use a KDTree nearest neighbor lookup here since only the neighbors within a certain distance are considered.
Apparently, as written in https://github.com/OpenSenseAction/poligrain/issues/15#issuecomment-2009301516, we already decided during the meeting in Prague that the KDTree neighbor lookup is more important than the sparse distance matrix calculation.

Conclusion: Priorities KDTree neighbor lookup for points. Maybe add sparse distance matrix calculation, too.

UPDATE one day later:

There seems to be an option to set the fill_value of a COO sparse matrix in the sparse package, see the docs. If this works, we do not have the problem that all empty values are treated as zeros in a sparse matrix, which could lead to strange behavior of code in our case.

Still we will prioritize the KDTree neighbor lookup for points.

cchwala commented 1 month ago

The way to return the data from the KDTree.query() could be done like in #43, specifically like the code here https://github.com/OpenSenseAction/poligrain/pull/43#discussion_r1591489036

OpenSenseAction / poligrain

Fast nearest neigbor lookup with KDTree #37