Closed cchwala closed 1 month ago
Note that there is also the option in scipy
to calculate a sparse distance matrix using KDTree.
Maybe we could make it an option to use this sparse version in the current function for the calculation of the distance matrix.
We might hover, also switch to the simple nearest-neigbor lookup via KDTree.query
because this is most often the use case, e.g. for QC with neighboring stations as we do it in pypwsqc
.
It should be fairly easy to use the sparse distance calculations. We just have to exchange the code in https://github.com/OpenSenseAction/poligrain/blob/5832645bd88122ad812cecc6df27eab99bc71204/src/poligrain/spatial.py#L113 with the sparse version.
We would still need to decide if we want a function that simply returns a list of nearest neighbors for a specific point using KDtree and query. Maybe it is enough to have the sparse distance matrix with nice axis labels based on IDs of the stations that have been used.
Some notes based on preliminary tests:
distance > max_distance
. If one is aware of that peculiarity, that will not cause problems, but if somebody would not exclude all IDs with zero distance (which also contains IDs from far-away sensors) this might results in undesired behavior of code that works okay with a dense distance matrix calculations.Conclusion: Priorities KDTree neighbor lookup for points. Maybe add sparse distance matrix calculation, too.
UPDATE one day later:
There seems to be an option to set the fill_value
of a COO sparse matrix in the sparse
package, see the docs. If this works, we do not have the problem that all empty values are treated as zeros in a sparse matrix, which could lead to strange behavior of code in our case.
Still we will prioritize the KDTree neighbor lookup for points.
The way to return the data from the KDTree.query()
could be done like in #43, specifically like the code here https://github.com/OpenSenseAction/poligrain/pull/43#discussion_r1591489036
For very large number of sensors, above 10,000, the distance matrix calculation gets pretty slow (something like 30 sec for 10,000 sensors against themselfs, if I recall correctly) and this scales quadratically (I guess...). Hence we might also want to provide the option to calculat nearest neighbors with
scipy.spatial.KDTree
, since we typically only use sensors within a certain distance (max. some tens of kilometers) which means that most parts of the distance matrix is not used if our domain is large e.g. for country-wide data in Germany.Originally posted by @cchwala in https://github.com/OpenSenseAction/poligrain/issues/15#issuecomment-1924644621