Increase maximum tree depth

Huite commented 6 months ago

The current max depth is set to 32. I took this smallish value, but currently working on some smalll, but somewhat perverse data (an earcut triangulation from polygons), it creates out-of-bounds errors. Setting the depth to 128 fixes it.

Huite commented 3 months ago

Just check the VTK CellTreeLocator which uses a C++ std::stack, which I guess is heap allocated.

Would be worthwhile to compare with a numba list, and do some benchmarking. The stack allocated stack might be a premature optimization on my part.

Generating a largish mesh, 2.6 million triangles:

import geopandas as gpd
import numpy as np
import shapely.geometry as sg

import pandamesh as pm

outer_coords = np.array([(0.0, 0.0), (10.0, 0.0), (10.0, 10.0), (0.0, 10.0)]) * 1e4
inner_coords = np.array([(3.0, 3.0), (7.0, 3.0), (7.0, 7.0), (3.0, 7.0)]) * 1e4
line_coords = np.array([(2.0, 8.0), (8.0, 2.0)]) * 1e4
inner = sg.LinearRing(inner_coords)
outer = sg.LinearRing(outer_coords)
line = sg.LineString(line_coords)
donut = sg.Polygon(outer, holes=[inner])

gdf = gpd.GeoDataFrame(data={"cellsize": [100.0]}, geometry=[donut])

mesher = pm.TriangleMesher(gdf)
vertices, triangles = mesher.generate()

vertices.astype(np.float64).tofile("vertices.bin")
triangles.astype(np.int64).tofile("triangles.bin")

Benchmarking:

import numpy as np
import numba_celltree

# %%

triangles = np.fromfile("c:/src/pandamesh/triangles.bin", dtype=np.int64).reshape((-1, 3))
vertices = np.fromfile("c:/src/pandamesh/vertices.bin", dtype=np.float64).reshape((-1, 2))
# %%

tree = numba_celltree.CellTree2d(vertices, triangles, -1)

# %%

xmin = vertices[:, 0].min()
ymin = vertices[:, 1].min()
xmax = vertices[:, 0].max()
ymax = vertices[:, 1].max()
points = np.random.rand(int(1e6), 2)
points[:, 0] = points[:, 0] * (xmax - xmin) + xmin 
points[:, 1] = points[:, 1] * (ymax - ymin) + ymin 
# %%

%timeit result = tree.locate_points(points)

Static stack-memory stack, size 32:

64.2 ms ± 1.76 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
63.3 ms ± 1.02 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
64 ms ± 738 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
63.3 ms ± 738 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
67 ms ± 877 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Static stack, size 128 gives basically the same result.

A list based stack (which allows us to just append and pop), is at least twice slower:

260 ms ± 81.2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
312 ms ± 9.07 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
135 ms ± 10.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
317 ms ± 2.76 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
127 ms ± 1.66 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

No clue why the timings are so inconsistent.

A ordinary heap-memory stack, size 32:

68.9 ms ± 2.82 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
67.2 ms ± 701 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
67.1 ms ± 804 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
64.3 ms ± 2.71 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
65 ms ± 265 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Heap memory, size 128:

67.6 ms ± 1.2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
67.2 ms ± 438 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
67.7 ms ± 669 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
67.1 ms ± 406 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
64.8 ms ± 3.09 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

It's still a pretty tiny allocation all things together; it does allocate once per query, since the queries are parallellized and we don't want to share pre-allocated memory.

I'll keep the stack-memory, and bump up the stack size for now.

Huite commented 3 months ago

Alternatively, we could implement bounds-checking. Ideally, I guess it reallocates a larger stack. That might have some performance ramifications, although the heap allocations above suggest that it will be negligible.. Alternatively, at least for somewhat normal meshes which result in a somewhat balanced tree, the current stack of 128 has such that that it should readily support billions of cells.

If someone runs into an issue in the future, they should enable bounds checking for numba. A resulting index error is then almost certainly due to the fixed stack size.

Deltares / numba_celltree

Increase maximum tree depth #13