Open JoerivanEngelen opened 2 months ago
I would've expected as much, essentially. Median just wraps numpy nan percentile.
I even left a TODO there about better performance: https://github.com/Deltares/xugrid/blob/fd1938aa84e54e81f788a04be01962c663626ff9/xugrid/regrid/reduce.py#L133
We're numba jitting, so it'll call the numba implementation. If you follow that link, you can see in the numba source code that it's allocating several numpy arrays (on the heap). During regridding, this results in a huge number of tiny heap allocations (one allocation per target index), which obviously has bad performance characteristics.
I've thought about pre-allocating a buffer, and passing that to each function to use as a workspace instead. (In most cases, such a buffer is probably small enough to fit in a CPU cache.) I think the challenge here is that ideally we do the reduction in parallel (see the numba prange in make_regrid); which requires one buffer per parallel process and makes the parallellization less straightforward.
Actually, it looks like I'm mistaken.
make_regrid
uses a closure to insert the redunction function f
here:
def _regrid(source: FloatArray, A: MatrixCSR, size: int):
n_extra = source.shape[0]
out = np.full((n_extra, size), np.nan)
for extra_index in numba.prange(n_extra):
source_flat = source[extra_index]
for target_index in range(A.n):
slice = row_slice(A, target_index)
indices = A.indices[slice]
weights = A.data[slice]
if len(indices) > 0:
out[extra_index, target_index] = f(source_flat, indices, weights)
return out
Numba parallellisation is occuring over the n_extra
dimensions (the flattened time, layer dimensions e.g.).
The next would work fine since it's allocated per parallel process:
source_flat = source[extra_index]
buffer = np.empty(buffersize)
for target_index in range(A.n):
slice = row_slice(A, target_index)
indices = A.indices[slice]
weights = A.data[slice]
if len(indices) > 0:
out[extra_index, target_index] = f(source_flat, indices, weights, buffer)
return out
Then we can provide some additional memory for each reduction function. This would also allow us the write the mode implementation in a better way, since it's currently also allocating a copy.
The trickiest part is deciding the size of the buffer.
Actually, deciding on the size of the buffer is probably also quite easy. In the sparse matrix, the largest number of non-zeros in any row suffices.
In the iMOD Python test bench, I noticed that changing the method "mode" to "median" of one specific variable (idomain), caused the tests to take more than 1 hour instead of 20 minutes. This could be due to some faulty agent on TeamCity, but I thought some simple performance tests are interesting to conduct here. I noticed a TODO in the median method notifying that this specific method might be not so performant, because np.nanpercentile is used here instead of a custom implementation.
I modified the OverlapRegridder example, to conduct some simple performance tests:
This printed:
Indeed the
median
method is slower than the rest. Next comes mode, but I think this is somewhat within the margin.