digitalfabrik / opendrift-leeway-webgui

OpenDrift Leeway simulations for Search and Rescue operations (experimental!)
https://leeway.tuerantuer.org/
Apache License 2.0
12 stars 3 forks source link

Only plot particle clusters for example with k-means #84

Open svenseeberg opened 9 months ago

svenseeberg commented 9 months ago

We can use k-means clustering to calculate k clusters so that no end position particle is further than 3km from each cluster center. This can be used as a base for simple search patterns (best route through all cluster centers).

The cluster centers can be ordered so that a path through all centers is the shortest from the current position of the SAR vessel.

The coordinates can be attached as KML or GPX track.

svenseeberg commented 9 months ago

The challenge here is to use a valid distance metric in the k-means function. scikit-learn for example can only use the euclidian metric, which does not really work in our case. There is a significant difference in distance when comparing 1° in longitude to 1° in latitude. That means we need a k-means implementation that can use a distance function for coordinates on a sphere. It seems that NLTK implements a k-means method that supports a custom distance function.

The alternative would be to transform the lon/lat coordinates into x/y/z coordinates and apply k-means on them: https://github.com/shezmic/Geodetic-To-Cartesian-and-Vice-Versa/blob/master/converter.py

julled commented 9 months ago

i guess it always depends on the geometry how many k make sense, do you think there is a generic answer to this question? Another option is to use DBSCAN which also estimates the k of the clustering

SteffenME commented 9 months ago

I also think that a density based approach like dbscan is better and can be used with meaningful parameters e.g. 2km for neighbourhood distance. If you have two many k you miss the highest densities. Another option would be to calculate local Maxima of density map.

SteffenME commented 9 months ago

Transforming the lat/Lon to X/Y is necessary for any approach, I'd suggest to transform to UTM like we do in satsearch, or whenever we calculate distances. The code is already there. Can be done with pyproj or geopandas.

The challenge here is to use a valid distance metric in the k-means function. scikit-learn for example can only use the euclidian metric, which does not really work in our case. There is a significant difference in distance when comparing 1° in longitude to 1° in latitude. That means we need a k-means implementation that can use a distance function for coordinates on a sphere. It seems that NLTK implements a k-means method that supports a custom distance function.

The alternative would be to transform the lon/lat coordinates into x/y/z coordinates and apply k-means on them: https://github.com/shezmic/Geodetic-To-Cartesian-and-Vice-Versa/blob/master/converter.py

svenseeberg commented 9 months ago

https://ocefpaf.github.io/python4oceanographers/blog/2013/12/16/utm/

svenseeberg commented 7 months ago

A random interesting resource for k-means clustering with Python: https://domino.ai/blog/getting-started-with-k-means-clustering-in-python

julled commented 7 months ago

i am not a big fan of k-means clustering, as the user always need to provide the parameter "k". I am more in favour of DBSCAN, as this does automatic cluster center count estimation

SteffenME commented 7 months ago

Some Clustering ideas with plots:

from sklearn.cluster import KMeans
import numpy as np
from pyproj import Transformer
import matplotlib.pyplot as  plt

transform_from_4326=Transformer_to_epsg=Transformer.from_crs(4326,epsg).transform

def distance(a,b):
    return np.sqrt(np.abs((a[0]-b[0])**2+(a[1]-b[1])**2))

end_lon,end_lat=lonlats[...,-1]
for point in zip(end_lon,end_lat):
    xx,yy = transform_from_4326(point[0],point[1])
    X.append([xx,yy])

X=np.array(X)

clusters = 2
while True:
    clustering=KMeans(n_clusters=clusters,max_iter=1000).fit(X)
    labels=clustering.labels_
    cluster_centers=clustering.cluster_centers_
    distances=[[]]*clusters
    for i in range(clusters):
        for xx in X[labels==i]:
            distances[i].append(distance(cluster_centers[i],xx))
    if np.array(distances).max()>2000:
        clusters+=1
    else:
        break
for i in range(clusters):
    color=np.zeros((100,3))+np.random.randint(0, 256, size=3)/256
    plt.scatter(np.array(x)[labels==i],np.array(y)[labels==i],c=color[labels==i])
    s=(max(100,(len(labels[labels==i])*50)))
    plt.scatter(cluster_centers[i][0],cluster_centers[i][1],s=s,c=color[0])

cluster