convolve: unexpected performance profile

azmyrajab commented 5 months ago

Thanks for this useful package!

I was making use of your convolve expression function (in particular mode=LEFT). I noticed that polars-ds convolve is slower than numpy - for some kernel sizes - and then interestingly much faster at larger kernel sizes. I thought that the unexpected performance profile perhaps indicates there is something relatively easy that can be done to improve performance for smaller kernels (which tend to be common usage).

This is some sample code which document the performance profile vs numpy on my machine:

import polars_ds as pds
import polars as pl

df = pl.DataFrame({"x": np.random.normal(size=100_000)}).with_row_index().with_columns(pl.lit(1).alias("const"))

warm_up_run = 1
results_numpy = []
results_pds = []
kernel_sizes = (2, 3, 5, 10, 50, 100, 300, 500)

for i, k in enumerate(kernel_sizes):
    kernel =  np.ones(k)

    if i >= warm_up_run:
        with timer("numpy") as t_numpy:
            convolve_numpy = np.convolve(df["x"], kernel, mode="full")[:len(df)]
        results_numpy.append(t_numpy())

        with timer("polars ds") as t_pds:
            convolve_pds = df.select(pds.convolve("x", kernel, mode="left"))
        results_pds.append(t_pds())

        np.allclose(convolve_numpy, convolve_pds.to_numpy().flatten())

pl.DataFrame({"numpy": results_numpy,
              "polars-ds": results_pds,
              "kernel_sizes": kernel_sizes[warm_up_run:],}
              ).pipe(px.line, x="kernel_sizes", y=["numpy", "polars-ds"], markers="x")

Notice that your package convolve's runtime decreases with kernel size:

And it does eventually become stable positive in runtime (as we expect it should converge to k log(k))

And btw for these larger sizes your implementation becomes a lot faster than numpy which is great.

I noticed that you are using realfft which itself uses rustfft which looks like a decent implementation. On a very quick look wonder if could be related to how FFT planner works choosing some acceleration based on inputs? Otherwise could it be related to allocation of arrays and padding and whatnot?

thank you in advance

abstractqqq commented 5 months ago

Thank you for this issue! Indeed I have noticed it myself (for FFT) and here is what I think is happening, and the same optimizations might be applicable to convolve as well...

I think NumPy/SciPy's FFT might have "small size optimization", which (according to my understanding), tries to unroll for loops (write procedure code without for loops), use compile time fixed size arrays for stack allocation etc.. As the name suggests, such techniques are great for small size.

For Rust FFT, I don't think it implements these optimization, as they are very specialized and used only when N is small. When N is small, the overhead of the planner might be more apparent.. That said, I am not sure why small kernel size might be slower... But when things are small, the timings might be a little off or have big variance. Not sure...

I do care about performance a lot, but FFT/convolution is such a big subject that I feel if I keep digging down this rabbit hole (.e.g implement the small size optimizations) I won't be able to make progress for the package 😂.. I am still very glad to see you bringing this up.

I will keep this open and see what I can do to improve performance.

azmyrajab commented 5 months ago

Thanks for your reply, and totally get that fft convolution is one tiny corner in what this package is trying to do. No point wasting a lot of time here absent a relatively quick win imo.

For what its worth - the timings on randomly generated data are very consistent on my machine: almost deterministically resulting in the below graph.

So I'm unsure it is the lack of small size optimization (because if that was purely the culprit we should not see a downward sloping runtime at any point w/ polars-ds. Things getting reliably faster for larger kernels, to me, indicates some form of switching behaviour happening in rust implementation.

It's like reliably faster to pad with zeros, up to some degree, for me :)

azmyrajab commented 5 months ago

Going through the rustfft docs -

The recommended way to use RustFFT is to create a FftPlanner instance and then call its plan_fft method. This method will automatically choose which FFT algorithms are best for a given size and initialize the required buffers and precomputed data.

// Perform a forward FFT of size 1234
use rustfft::{FftPlanner, num_complex::Complex};

let mut planner = FftPlanner::new();
let fft = planner.plan_fft_forward(1234);

let mut buffer = vec![Complex{ re: 0.0f32, im: 0.0f32 }; 1234];
fft.process(&mut buffer);

The planner returns trait objects of the Fft trait, allowing for FFT sizes that aren’t known until runtime.

RustFFT also exposes individual FFT algorithms. For example, if you know beforehand that you need a power-of-two FFT, you can avoid the overhead of the planner and trait object by directly creating instances of the Radix4 algorithm:

If you do get the bandwidth, whenever, maybe worth chucking this [radix4] (https://docs.rs/rustfft/latest/rustfft/algorithm/struct.Radix4.html) as planner instead of letting it be too clever w/ generic planner and check that performance is faster (or comparable) to numpy at all power of 2 kernels (e.g. test 2, 4, 8, 16, 256). It could tell us that the problem is indeed in (ineffective) small size optimization and a way to potentially side-step this

abstractqqq commented 5 months ago

I will spend some time digesting your comments and see what I can do.. This is a wealth of great information. Thanks a lot

abstractqqq commented 5 months ago

As some partial progress, I realized that for small kernel sizes, doing convolution by hand is faster. E.g. on my local computer, running convolution on data of size 100_000 produced the following results:

%timeit df.select(pds.convolve("x", kernel, mode="left"))
%timeit df.select(pds.convolve2("x", kernel, mode="left"))
# kernel size 100
# 6.21 ms ± 39.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# 4.93 ms ± 21.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# kernel size 20
# 7.73 ms ± 34.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# 1.08 ms ± 6.45 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
# More significant increase on smaller kernel sizes

Would it be good for your use case if I simply set two paths for convolution depending on kernel size? Or I can provide two version of convolution and let the user decide which one to use for their use case.

azmyrajab commented 5 months ago

Hi @abstractqqq - thanks for looking into this. I think that either of those proposed solutions make sense to me - thanks! Slight preference to making the switch for the user, personally, as one less knob to expose but either way sounds great.

If you opt for switch we can make it at approximately the kernel size (for your 100k test sample) where pds.convolve is faster than pds.convolve2 - maybe in the region of ~300 I guess.

For your direct small kernel convolution function, if its not too much of a hassle, you could try some "loop unrolling" - basically summing e.g. a batch of 5 at a time (I'm hoping that could add a bit of additional performance to further bridge gap with scipy).

In any case thank you for working on this!

abstractqqq commented 5 months ago

v0.4.3 will be published soon and the current patch is to give the user the option to choose a method between direct and fft Some small perf improvement was also made by passing dot product computation to ndarray which has better codegen for such things. I am going to keep this thread open in case more people want to jump in and share some ideas.

azmyrajab commented 5 months ago

The way you implemented ConvMethod {FFT, DIRECT} looks pretty clean to me.

As a last idea from my side - would it be possible to add a "PARALLEL" option as well? That'd use rayon (which I think your project already has a dependency) to "cheat" and compute the dot products in parallel. It should, I think provide a, good option when compute is abundant and kernel size is relatively small.

Obviously if the user is calling this in a .over(...) context the nested parallelism could hurt performance and so they'd need to choose this option carefully.

            let kernel = ArrayView1::from(filter);  
            Ok(
                input
                .par_windows(filter.len()) // if parallel else .windows() if direct
                .map(|sl| {
                    let slice = ArrayView1::from(sl);
                    kernel.dot(&slice)
                })
                .collect()
            )

abstractqqq commented 5 months ago

The way you implemented ConvMethod {FFT, DIRECT} looks pretty clean to me.

As a last idea from my side - would it be possible to add a "PARALLEL" option as well? That'd use rayon (which I think your project already has a dependency) to "cheat" and compute the dot products in parallel. It should, I think provide a, good option when compute is abundant and kernel size is relatively small.

Obviously if the user is calling this in a .over(...) context the nested parallelism could hurt performance and so they'd need to choose this option carefully.
            let kernel = ArrayView1::from(filter);  

            Ok(

                input

                .par_windows(filter.len()) // if parallel else .windows() if direct

                .map(|sl| {

                    let slice = ArrayView1::from(sl);

                    kernel.dot(&slice)

                })

                .collect()

            )

Sure. And yes your point about running this in .over/group by is exactly correct.

I will add it for direct. Potentially you can also run FFT in parallel but that requires way more engineering, so I will put that off for the next time. I still think that users should, if possible, prioritize running multiple convolutions in parallel instead of running one convolution in parallel. But again, there are many cases that call for faster single process..

abstractqqq commented 5 months ago

500k data, 1000 kernel size, one expression in select statement direct: 68.7 ms ± 177 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) direct + parallel: 13.7 ms ± 51.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) FFT: 35.7 ms ± 304 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) SciPy: 20.2 ms ± 67.1 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

500k data, 100 kernel size, one expression in select statement direct: 9.2 ms ± 37.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) direct + parallel: 5.52 ms ± 38.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) FFT: 83.5 ms ± 246 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) SciPy: 7.08 ms ± 35.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Will be in the next release, which won't happen till at least the end of this week.

FFT really scales strangely but this was already pointed out in previous comments.

abstractqqq / polars_ds_extension

convolve: unexpected performance profile #135