Deltares / pandamesh

🐼 From geodataframe to mesh ▦
https://deltares.github.io/pandamesh/
MIT License
28 stars 9 forks source link

Shapely STRtree versus scipy KDTree for snapping #40

Closed Huite closed 2 months ago

Huite commented 2 months ago

I've currently used the scipy KDTree with a sparse distance matrix to get snapping candidates, but this obviously requires scipy.

It might be worthwhile to use the shapely STRtree instead; we just have to build the CSR form ourselves.

Huite commented 2 months ago

With some 1.3 million points:

def locate_scipy(geometry, max_distance):
    xy = shapely.get_coordinates(geometry)
    tree = KDTree(xy)
    return tree.sparse_distance_matrix(tree, max_distance=max_distance).tocsr()

def locate_shapely(geometry, max_distance):
    tree = shapely.STRtree(geometry)
    i, j = tree.query(geometry, distance=max_distance, predicate="dwithin")
    coords = shapely.get_coordinates(geometry)
    dxy = coords[i] - coords[j]
    distance = np.linalg.norm(dxy, axis=1)
    return i, j, distance

matrix = locate_scipy(geometry, 2.0)

i, j, distance = locate_shapely(geometry, 2.0)
coo = sparse.coo_matrix((distance, (i, j)))
csr = coo.tocsr()

assert np.array_equal(csr.indptr, matrix.indptr)
assert np.array_equal(csr.indices, matrix.indices)
assert np.allclose(csr.data, matrix.data)

Of course, this still requires scipy sparse. But we can easily construct the indptr ourselves:

def locate_shapely(geometry, max_distance):
    tree = shapely.STRtree(geometry)
    i, j = tree.query(geometry, distance=max_distance, predicate="dwithin")
    coords = shapely.get_coordinates(geometry)
    dxy = coords[i] - coords[j]
    distance = np.linalg.norm(dxy, axis=1)
    indptr = np.empty(len(geometry) + 1)
    indptr[0] = 0
    indptr[1:] = np.cumsum(np.bincount(i))
    return indptr, j, distance

%timeit matrix = locate_scipy(geometry, 2.0)
%timeit indptr, indices, distance = locate_shapely(geometry, 2.0)
1.81 s ± 101 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
3.13 s ± 199 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

So it's only a factor 1.7 slower. That doesn't make scipy as a dependency worthwhile.

The scipy CSR matrix will have sorted indices, whereas the shapely constructed one won't, but for the snapping case, the order is arbitrary anyway.