Resampy 0.3.0 is slow for multi-dimensional input

svankuyk commented 2 years ago

Compared to 0.2.2,

For 0.3.0, this is fast: x=np.random.rand(48000) y=resampy.resample(x, axis=0, 48000, 24000, parallel=False)

whereas changing to: x=np.random.rand(48000, 1)

Is extremely slow. Even with parallel=true, the performance for multi-dimensional input is worse than 0.2.2.

This helps a lot: y=resampy.resample(x[:, 0], axis=0, 48000, 24000, parallel=False) I.e., It’s better to loop over the channel dimension than pass 2-dimensional inputs directly.

In summary, the new changes to enable parallel processing are great for 1D inputs but significantly regress for 2D inputs (which is common for multi-channel audio), and even with parallel processing disabled, there is a regression for 2D inputs in 0.3.0.

tested on MacBook with Python3.7

bmcfee commented 2 years ago

Thanks for raising this issue. I'll admit to being pretty confused by this!

Just testing out your setup, I'm seeing the following:

In [13]: %timeit resampy.resample(x[:, 0], 48000, 24000, axis=0, parallel=False)
22.9 ms ± 211 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [14]: %timeit resampy.resample(x[:, 0], 48000, 24000, axis=0, parallel=True)
5.59 ms ± 50.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [15]: %timeit resampy.resample(x, 48000, 24000, axis=0, parallel=False)
730 ms ± 1.73 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [16]: %timeit resampy.resample(x, 48000, 24000, axis=0, parallel=True)
277 ms ± 1.61 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

So indeed, the multidimensional case is slower, but there's still a speedup from parallel=False to parallel=True as we should expect. I don't think parallelism is the source of the problem.

There were a couple of other changes from 0.2.2 to 0.3 that could be relevant here, notably #90. In the example above, there are no substantive memory layout changes that should affect anything here, and the swapaxes calls are bypassed (since we're resampling axis 0).

The only thing that I can think of that might have caused this regression is a simplification of the interpolation loop code from explicitly iterating over channels: https://github.com/bmcfee/resampy/blob/ccb85575403663e17697bcde8e25765b022a9e0f/resampy/interpn.py#L41-L47

to implicitly doing so: https://github.com/bmcfee/resampy/blob/f8d4bb93d99a3edc2be972a07a9055ac4248f15a/resampy/interpn.py#L39-L44

This was done to remove the need to flatten the input/output arrays into 2D, which was causing some issues in higher-dimensional interpolation #73. It might be that numba is having a harder time optimizing this now since the number of dimensions is variable at runtime, and the trailing dimensions are syntactically hidden.

bmcfee commented 2 years ago

Ok, after digging into this a bit, I think the recommended strategy here would be to rewrite the interpolator as a generalized ufunc using @guvectorize. This would essentially lift the interpolation loop out to run in parallel over all trailing dimensions (I think) and ought to alleviate the issue without having to jump through more hoops.

I'm not sure how this will interact with the parallel/serial implementation though. We can use target='parallel' instead of target='cpu', but I'm not sure it will have the same effect overall. Something to experiment with anyway.

bmcfee commented 2 years ago

Took a quick stab at guvectorize here - did not pay off, but maybe it could with a more careful implementation.

bmcfee commented 2 years ago

Ok, I've got this back down to parity with serial single-threaded execution.

`@guvectorize` implementation

```python import numba from numba import float64 def _resample_f(x, t_out, interp_win, interp_delta, num_table, scale, y): index_step = int(scale * num_table) time_register = 0.0 n = 0 frac = 0.0 index_frac = 0.0 offset = 0 eta = 0.0 weight = 0.0 nwin = interp_win.shape[0] n_orig = x.shape[0] n_out = t_out.shape[0] for t in numba.prange(n_out): time_register = t_out[t] # Grab the top bits as an index to the input buffer n = int(time_register) # Grab the fractional component of the time index frac = scale * (time_register - n) # Offset into the filter index_frac = frac * num_table offset = int(index_frac) # Interpolation factor eta = index_frac - offset # Compute the left wing of the filter response i_max = min(n + 1, (nwin - offset) // index_step) for i in range(i_max): weight = (interp_win[offset + i * index_step] + eta * interp_delta[offset + i * index_step]) y[t] += weight * x[n-i] # Invert P frac = scale - frac # Offset into the filter index_frac = frac * num_table offset = int(index_frac) # Interpolation factor eta = index_frac - offset # Compute the right wing of the filter response k_max = min(n_orig - n - 1, (nwin - offset) // index_step) for k in range(k_max): weight = (interp_win[offset + k * index_step] + eta * interp_delta[offset + k * index_step]) y[t] += weight * x[n + k + 1] pass resample_f_p = numba.guvectorize([(float64[:], float64[:], float64[:], float64[:], float64, float64, float64[:])], "(n),(m),(p),(p),(),()->(m)", target='parallel')(_resample_f) resample_f_s = numba.guvectorize([(float64[:], float64[:], float64[:], float64[:], float64, float64, float64[:])], "(n),(m),(p),(p),(),()->(m)", target='cpu')(_resample_f) ```

Benchmark results below. resample is the guvectorize version, resampy.resample is 0.3.0.

>>> x = np.random.randn(48000, 1)
>>> y = resample(x, 48000, 24000, axis=0, parallel=True)
>>> y2 = resampy.resample(x, 48000, 24000, axis=0, parallel=True)
>>> np.allclose(y, y2)
True

>>> # Benchmark with and without parallelism
>>> %timeit resample(x, 48000, 24000, axis=0, parallel=False)
26.1 ms ± 336 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

>>> %timeit resample(x, 48000, 24000, axis=0, parallel=True)
26.8 ms ± 923 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

>>> %timeit resampy.resample(x, 48000, 24000, axis=0, parallel=False)
744 ms ± 6.05 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

>>> %timeit resampy.resample(x, 48000, 24000, axis=0, parallel=True)
281 ms ± 4.43 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

>>> # And compare to 1d slices
>>> %timeit resample(x[:, 0], 48000, 24000, axis=0, parallel=False)
25.8 ms ± 721 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

>>> %timeit resample(x[:, 0], 48000, 24000, axis=0, parallel=True)
25.7 ms ± 233 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

>>> %timeit resampy.resample(x[:, 0], 48000, 24000, axis=0, parallel=False)
23.5 ms ± 234 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

>>> %timeit resampy.resample(x[:, 0], 48000, 24000, axis=0, parallel=True)
7.28 ms ± 146 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

So the good news is that we get back down to match 1D serial performance. The bad news is that target='parallel' does not match the numba parallel=True option (last line). This makes sense in this context because there's no channel-level parallelism to exploit here (there's only one dummy channel), and guvectorize does not parallelize prange loops.

There might be a way to do this where we split the interpolator into an outer vectorized function and an inner parallel-jitted function, but at present I don't see an easy way to accomplish both with one function.

bmcfee commented 2 years ago

Ok, confirmed that the indirect vector-parallel wrapper does the job!

Skipping the gory details, the trick looks as follows:

def _resample_f_p(x, t_out, interp_win, interp_delta, num_table, scale, y):
    _resample_loop_p(x, t_out, interp_win, interp_delta, num_table, scale, y)

def _resample_f_s(x, t_out, interp_win, interp_delta, num_table, scale, y):
    _resample_loop_s(x, t_out, interp_win, interp_delta, num_table, scale, y)

_resample_loop_p = numba.jit(nopython=True, parallel=True)(_resample_loop)
_resample_loop_s = numba.jit(nopython=True, parallel=False)(_resample_loop)

resample_f_p = numba.guvectorize([(float64[:], float64[:], float64[:], float64[:], float64, float64, float64[:])],
                   "(n),(m),(p),(p),(),()->(m)", nopython=True)(_resample_f_p)

resample_f_s = numba.guvectorize([(float64[:], float64[:], float64[:], float64[:], float64, float64, float64[:])],
                   "(n),(m),(p),(p),(),()->(m)", nopython=True)(_resample_f_s)

Runtimes on the previous example:

>>> %timeit resample(x, 48000, 24000, axis=0, parallel=False)
23.8 ms ± 510 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

>>> %timeit resample(x, 48000, 24000, axis=0, parallel=True)
7.17 ms ± 453 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

>>> %timeit resampy.resample(x, 48000, 24000, axis=0, parallel=False)
733 ms ± 7.09 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

>>> %timeit resampy.resample(x, 48000, 24000, axis=0, parallel=True)
279 ms ± 6.81 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

>>> %timeit resampy.resample(x[:, 0], 48000, 24000, axis=0, parallel=False)
21.1 ms ± 497 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

>>> %timeit resampy.resample(x[:, 0], 48000, 24000, axis=0, parallel=True)
6.55 ms ± 284 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

I'm happy to call this one done. Will try to get it implemented properly and shipped as 0.3.1 by the end of the week.

bmcfee / resampy

Resampy 0.3.0 is slow for multi-dimensional input #102