Open cchwala opened 5 months ago
The provided code runs nicely on my computer and gives reasonable output as far as I can see.
I have some code that I use to find points near lines that do not need the shapely
to run. The code also runs without projecting the coordinates. However, the code is not that intuitive to read and would need some upgrades to work exactly as intended.. I will have a look at it so that we can discuss it, but for now (and most likely in the future) I think the code you have provided here is better to build on.
One thing to discuss is whether the coordinates should be provided as latitude/longitude or as a in a projected coordinate in a reference grid?
One thing to discuss is whether the coordinates should be provided as latitude/longitude or as a in a projected coordinate in a reference grid?
That is a good point. With shapely
or scipy.spatial
we will rely on their built-in distance metrics, which, as far as I saw when doing a bit of a search, does not include something that handles lon-lat degree coordinates. If I recall correctly one can supply custom distance metrics via Python code, at least in scipy
, but this results in significantly slower calculations.
Because of that, I would suggest to enforce projecting to length-preserving (or something that comes close) local coordinates. We could also enforce/suggest tot store these as x
and y
variables or for CML as e.g. site_0_x
and site_0_y
because, anyway, almost all maps make more sense in projected coordinates.
The user should of course be able to choose which projection to use, but we could provide some sane defaults. Naming and storing the projected coordinates needs to be synchronized with the WG1 data format conventions.
We could also add the functionality that the distance calculations always project the data before doing the actual calculation. We have to be careful to not hide too much magic with our functions.
Maybe something like this could be a compromise:
def get_gauge_distance_per_cml(ds_cmls, ds_gauges, project_coordinates_to=None):
# Check if ds_cmls.site_0_x etc... and ds_gauges.x and y exist
# If not check if project_coordinates_to is a valid EPSG string
# If not, raise Error
# If yes, project data
# If yes, proceed
# calculate distance using x and y cooridnates
But then we have to handle the case where one of ds_cmls
and ds_gauges
has projected coordinates and the other does not, etc...
Maybe it is easier to do something like
ds_cmls = add_projected_coordinates_from_lon_lat(ds_cmls, projection=`some_EPSG_string`)
get_gauge_distance_per_cml(...) # This raises an error if x and y are not there
We would need to check if add_projected_coordinates_from_lon_lat
mutates ds_cmls
or if we can make it only return an updated version. I prefer to not mutate inputs of a function, i.e. only having pure functions, so that ds_cmls_projected = add_projected....(ds_cmls, ...)
will not alter ds_cmls
. We could also return a tuple of the four CML coordinates, but that is not nice for the user who has to remember all the correct names of the 4 CML coordinates.
Note that scipy.spatial.distance.cdist
seems to be faster than scipy.spatial.distance_matrix
, which we currently use. From a quick look and quick test, they provide exactly the same result.
See:
For very large number of sensors, above 10,000, the distance matrix calculation gets pretty slow (something like 30 sec for 10,000 sensors against themselfs, if I recall correctly) and this scales quadratically (I guess...). Hence we might also want to provide the option to calculat nearest neighbors with scipy.spatial.KDTree
, since we typically only use sensors within a certain distance (max. some tens of kilometers) which means that most parts of the distance matrix is not used if our domain is large e.g. for country-wide data in Germany.
There are several ways to calculate the distance between points and points and lines available. I tested some of them and here are the findings. I tested the distance calculation between the following DWD stations available here.
Abenberg_4 10.9667 49.2346
Großenkneten_4 8.237 52.9336
pyproj.Geod
class described here. This approach digests lon/lat coords and is applicable for point-to-point and point-to-line. Compared to the others it is not very fast.pyproj.Geod
shapley.Linestring
length. Requires coordinates in meters e.g. UTM32. Calculates the Euclidean distance ignoring the earth's curvature. Fast.From this first testing I think pyproj.Geod
would be the most precise, does not require transformation from WGS84 and applies to point-to-point and point-to-line calculations (and probably point/line to polygon) but it takes ~200ms for one pair.
Thanks @maxmargraf for doing this comparison and for the summary.
Even though pyproj.Geod
is most accurate, it seems to be too slow (200ms for one pair) to use it for a large set of sensors. The Harvesine formula might lead to similar results, depending on radius of the earth that is used (which assumes a spherical earth).
In general, I think, our main limitation is that we need shapely
(or at least IMO we should prefer it over own implementations) for doing the calculation for line-to-line and line-to-point distances. And since, as far as I see, there is currently no way in shapely
to use something else than the euclidian distance (i.e. do correct lon-lat distance calculation on a the earth surface), we have to project the coordinates first before doing distance calculations. The only problem with this would be, if a dataset spans a domain that is too large for meaningful distance-preserving projections.
Or did you test something else than shapely
for point-to-line distances?
Or did you test something else than shapely for point-to-line distances?
No, please note thatpyproj.Geod
relies on shapely
objects
Okay. So for point-to-line or line-to-line we have the two options
pyproj.Geod
(which is slow) with shapely
objects with lon-lat or shapely
directly with coordinates in a distance-preserving projection (which is fast, maybe also the projection itself, since it is not done per pair but per coordinate)If so, I prefer 2.
Based on discussion on site, it might more important to focus on functions that get the nearest neighbors instead of a full distance matrix (even if it is a sparse distance matrix, see #37).
UPDATE of status of implementation:
Is your feature request related to a problem? Please describe. We need a unified way of finding neighboring sensors, both from points and lines, and a mixture of both.
Describe the solution you'd like There is an implementation for line-to-point neighbor lookup using
shapely
here in the prototype that I built some time ago. For line-to-line this is IMO the most solid implementation. It can also be used for line-to-line distance calculations.For point-to-point we could use functions from
scipy.spatial
, but since we already haveshapely
as dependency, this might not be needed. Most likely thescipy
functions will be much faster since they are limited to one specific use case, point-to-point distance calculation, whereasshapely
works with all its supported geometries.Hence, currently I would prefer using
shapely
for line-to-line and line-to-point distance calculation and probablyscipy
for point-to-point.There could/should (?) be a general interface to do these calculations. Maybe a good solutions would be a. ABC class, but maybe the added complexity is not worth it and it is better to use separate functions, e.g.
calc_line_to_point_distances(ds_lines, ds_points)
.Describe alternatives you've considered There might be code for line-to-point distance calculations because several people (maybe from @eoydvin or @maxmargraf ?) have used that to find gauges close to CMLs. Hence, we can collect and discuss these implementations and see if we do not need
shapely
for the line-to-point distances. However, we will needshapely
as a dependency if we stick with the fast and tested grid intersection code frompycomlink
to be implemented in #5.