Add method for finding neighoring points and lines

cchwala commented 5 months ago

UPDATE of status of implementation:

[x] line-to-point distance
[ ] lint-to-line distance

Is your feature request related to a problem? Please describe. We need a unified way of finding neighboring sensors, both from points and lines, and a mixture of both.

Describe the solution you'd like There is an implementation for line-to-point neighbor lookup using shapely here in the prototype that I built some time ago. For line-to-line this is IMO the most solid implementation. It can also be used for line-to-line distance calculations.

For point-to-point we could use functions from scipy.spatial, but since we already have shapely as dependency, this might not be needed. Most likely the scipy functions will be much faster since they are limited to one specific use case, point-to-point distance calculation, whereas shapely works with all its supported geometries.

Hence, currently I would prefer using shapely for line-to-line and line-to-point distance calculation and probably scipy for point-to-point.

There could/should (?) be a general interface to do these calculations. Maybe a good solutions would be a. ABC class, but maybe the added complexity is not worth it and it is better to use separate functions, e.g. calc_line_to_point_distances(ds_lines, ds_points).

Describe alternatives you've considered There might be code for line-to-point distance calculations because several people (maybe from @eoydvin or @maxmargraf ?) have used that to find gauges close to CMLs. Hence, we can collect and discuss these implementations and see if we do not need shapely for the line-to-point distances. However, we will need shapely as a dependency if we stick with the fast and tested grid intersection code from pycomlink to be implemented in #5.

eoydvin commented 5 months ago

The provided code runs nicely on my computer and gives reasonable output as far as I can see.

I have some code that I use to find points near lines that do not need the shapely to run. The code also runs without projecting the coordinates. However, the code is not that intuitive to read and would need some upgrades to work exactly as intended.. I will have a look at it so that we can discuss it, but for now (and most likely in the future) I think the code you have provided here is better to build on.

One thing to discuss is whether the coordinates should be provided as latitude/longitude or as a in a projected coordinate in a reference grid?

cchwala commented 5 months ago

One thing to discuss is whether the coordinates should be provided as latitude/longitude or as a in a projected coordinate in a reference grid?

That is a good point. With shapely or scipy.spatial we will rely on their built-in distance metrics, which, as far as I saw when doing a bit of a search, does not include something that handles lon-lat degree coordinates. If I recall correctly one can supply custom distance metrics via Python code, at least in scipy, but this results in significantly slower calculations.

Because of that, I would suggest to enforce projecting to length-preserving (or something that comes close) local coordinates. We could also enforce/suggest tot store these as x and y variables or for CML as e.g. site_0_x and site_0_y because, anyway, almost all maps make more sense in projected coordinates.

The user should of course be able to choose which projection to use, but we could provide some sane defaults. Naming and storing the projected coordinates needs to be synchronized with the WG1 data format conventions.

We could also add the functionality that the distance calculations always project the data before doing the actual calculation. We have to be careful to not hide too much magic with our functions.

Maybe something like this could be a compromise:

def get_gauge_distance_per_cml(ds_cmls, ds_gauges, project_coordinates_to=None):
    # Check if ds_cmls.site_0_x etc... and ds_gauges.x and y exist
        # If not check if project_coordinates_to is a valid EPSG string
             # If not, raise Error
             # If yes, project data
        # If yes, proceed

    # calculate distance using x and y cooridnates

But then we have to handle the case where one of ds_cmls and ds_gauges has projected coordinates and the other does not, etc...

Maybe it is easier to do something like

ds_cmls = add_projected_coordinates_from_lon_lat(ds_cmls, projection=`some_EPSG_string`)

get_gauge_distance_per_cml(...) # This raises an error if x and y are not there

We would need to check if add_projected_coordinates_from_lon_lat mutates ds_cmls or if we can make it only return an updated version. I prefer to not mutate inputs of a function, i.e. only having pure functions, so that ds_cmls_projected = add_projected....(ds_cmls, ...) will not alter ds_cmls. We could also return a tuple of the four CML coordinates, but that is not nice for the user who has to remember all the correct names of the 4 CML coordinates.

cchwala commented 4 months ago

Note that scipy.spatial.distance.cdist seems to be faster than scipy.spatial.distance_matrix, which we currently use. From a quick look and quick test, they provide exactly the same result.

See:

For very large number of sensors, above 10,000, the distance matrix calculation gets pretty slow (something like 30 sec for 10,000 sensors against themselfs, if I recall correctly) and this scales quadratically (I guess...). Hence we might also want to provide the option to calculat nearest neighbors with scipy.spatial.KDTree, since we typically only use sensors within a certain distance (max. some tens of kilometers) which means that most parts of the distance matrix is not used if our domain is large e.g. for country-wide data in Germany.

maxmargraf commented 4 months ago

There are several ways to calculate the distance between points and points and lines available. I tested some of them and here are the findings. I tested the distance calculation between the following DWD stations available here.

Abenberg_4 10.9667 49.2346
Großenkneten_4 8.237 52.9336

453.708 km - distance in google earth
453.710 km - distance from the pyproj.Geod class described here. This approach digests lon/lat coords and is applicable for point-to-point and point-to-line. Compared to the others it is not very fast.
452.989 km - distance using the Haversine formula implementation in pycomlink. Uses lon/lat coordinates, 1000x faster then pyproj.Geod
453.558 km - distance from shapley.Linestring length. Requires coordinates in meters e.g. UTM32. Calculates the Euclidean distance ignoring the earth's curvature. Fast.
scipy cdist: not yet tested.

From this first testing I think pyproj.Geod would be the most precise, does not require transformation from WGS84 and applies to point-to-point and point-to-line calculations (and probably point/line to polygon) but it takes ~200ms for one pair.

cchwala commented 4 months ago

Thanks @maxmargraf for doing this comparison and for the summary.

Even though pyproj.Geod is most accurate, it seems to be too slow (200ms for one pair) to use it for a large set of sensors. The Harvesine formula might lead to similar results, depending on radius of the earth that is used (which assumes a spherical earth).

In general, I think, our main limitation is that we need shapely (or at least IMO we should prefer it over own implementations) for doing the calculation for line-to-line and line-to-point distances. And since, as far as I see, there is currently no way in shapely to use something else than the euclidian distance (i.e. do correct lon-lat distance calculation on a the earth surface), we have to project the coordinates first before doing distance calculations. The only problem with this would be, if a dataset spans a domain that is too large for meaningful distance-preserving projections.

Or did you test something else than shapely for point-to-line distances?

maxmargraf commented 4 months ago

Or did you test something else than shapely for point-to-line distances?

No, please note thatpyproj.Geod relies on shapely objects

cchwala commented 4 months ago

Okay. So for point-to-line or line-to-line we have the two options

to use pyproj.Geod (which is slow) with shapely objects with lon-lat or
use shapely directly with coordinates in a distance-preserving projection (which is fast, maybe also the projection itself, since it is not done per pair but per coordinate)

If so, I prefer 2.

cchwala commented 3 months ago

Based on discussion on site, it might more important to focus on functions that get the nearest neighbors instead of a full distance matrix (even if it is a sparse distance matrix, see #37).

OpenSenseAction / poligrain

Add method for finding neighoring points and lines #15