Performance median method regridder worse than other methods

JoerivanEngelen commented 2 months ago

In the iMOD Python test bench, I noticed that changing the method "mode" to "median" of one specific variable (idomain), caused the tests to take more than 1 hour instead of 20 minutes. This could be due to some faulty agent on TeamCity, but I thought some simple performance tests are interesting to conduct here. I noticed a TODO in the median method notifying that this specific method might be not so performant, because np.nanpercentile is used here instead of a custom implementation.

I modified the OverlapRegridder example, to conduct some simple performance tests:

import xugrid as xu
import time

def create_grid(bounds, nx, ny):
    """Create a simple grid of triangles covering a rectangle."""
    import numpy as np
    from matplotlib.tri import Triangulation
    xmin, ymin, xmax, ymax = bounds
    dx = (xmax - xmin) / nx
    dy = (ymax - ymin) / ny
    x = np.arange(xmin, xmax + dx, dx)
    y = np.arange(ymin, ymax + dy, dy)
    y, x = [a.ravel() for a in np.meshgrid(y, x, indexing="ij")]
    faces = Triangulation(x, y).triangles
    return xu.Ugrid2d(x, y, -1, faces)

uda = xu.data.elevation_nl().ugrid.sel(
    x=slice(125_000, 225_000), y=slice(440_000, 500_000)
)
grid = create_grid(uda.ugrid.total_bounds, nx=1000, ny=1000)

functions = [
    "mean",
    "mean",  # Repeat first element to hack around JIT performance hits
    "harmonic_mean",
    "geometric_mean",
    "sum",
    "minimum",
    "maximum",
    "mode",
    "median",
    "max_overlap",
]

for f in functions:
    t0 = time.time()
    regridder = xu.OverlapRegridder(source=uda, target=grid, method=f)
    result = regridder.regrid(uda)
    t1 = time.time()
    tn = t1-t0
    print(f"{f}: {tn} seconds")

This printed:

mean: 4.471312522888184 seconds
mean: 0.17104887962341309 seconds
harmonic_mean: 0.717357873916626 seconds
geometric_mean: 0.8859498500823975 seconds
sum: 0.6982800960540771 seconds
minimum: 0.6448187828063965 seconds
maximum: 0.6634867191314697 seconds
mode: 1.0880956649780273 seconds
median: 9.672606706619263 seconds
max_overlap: 0.6971712112426758 seconds

Indeed the median method is slower than the rest. Next comes mode, but I think this is somewhat within the margin.

Huite commented 2 months ago

I would've expected as much, essentially. Median just wraps numpy nan percentile.

I even left a TODO there about better performance: https://github.com/Deltares/xugrid/blob/fd1938aa84e54e81f788a04be01962c663626ff9/xugrid/regrid/reduce.py#L133

We're numba jitting, so it'll call the numba implementation. If you follow that link, you can see in the numba source code that it's allocating several numpy arrays (on the heap). During regridding, this results in a huge number of tiny heap allocations (one allocation per target index), which obviously has bad performance characteristics.

I've thought about pre-allocating a buffer, and passing that to each function to use as a workspace instead. (In most cases, such a buffer is probably small enough to fit in a CPU cache.) I think the challenge here is that ideally we do the reduction in parallel (see the numba prange in make_regrid); which requires one buffer per parallel process and makes the parallellization less straightforward.

Huite commented 2 months ago

Actually, it looks like I'm mistaken.

make_regrid uses a closure to insert the redunction function f here:

    def _regrid(source: FloatArray, A: MatrixCSR, size: int):
        n_extra = source.shape[0]
        out = np.full((n_extra, size), np.nan)
        for extra_index in numba.prange(n_extra):
            source_flat = source[extra_index]
            for target_index in range(A.n):
                slice = row_slice(A, target_index)
                indices = A.indices[slice]
                weights = A.data[slice]
                if len(indices) > 0:
                    out[extra_index, target_index] = f(source_flat, indices, weights)
        return out

Numba parallellisation is occuring over the n_extra dimensions (the flattened time, layer dimensions e.g.).

The next would work fine since it's allocated per parallel process:

            source_flat = source[extra_index]
            buffer = np.empty(buffersize)
            for target_index in range(A.n):
                slice = row_slice(A, target_index)
                indices = A.indices[slice]
                weights = A.data[slice]
                if len(indices) > 0:
                    out[extra_index, target_index] = f(source_flat, indices, weights, buffer)
        return out

Then we can provide some additional memory for each reduction function. This would also allow us the write the mode implementation in a better way, since it's currently also allocating a copy.

The trickiest part is deciding the size of the buffer.

Huite commented 2 months ago

Actually, deciding on the size of the buffer is probably also quite easy. In the sparse matrix, the largest number of non-zeros in any row suffices.

Deltares / xugrid

Performance median method regridder worse than other methods #240