JuliaDSP / DSP.jl

Filter design, periodograms, window functions, and other digital signal processing functionality
https://docs.juliadsp.org/stable/contents/
Other
376 stars 109 forks source link

Implementation of `conv()` Uses Overlap and Save While Direct Method Is Faster #355

Open RoyiAvital opened 4 years ago

RoyiAvital commented 4 years ago

The current implementation uses in many cases the Overlap and Save method for Linear Convolution. While this method is optimized to a streamed data it is not optimized for cases where all data is given in memory.

See https://discourse.julialang.org/t/convolution-conv-with-same-size-output/38260/11:

Figure0001

I think it is better to implement direct loop with @simd and @inbounds. It will probably be faster than the current method for most cases. For longer kernels / signals it should use the Frequency Domain as it is implemented now,

ssfrr commented 4 years ago

See this similar plot (but for longer filters) that @galenlynch made that compares regular FFT processing to overlap-save. I think that the thing to do is to add another check for short convolutions that uses the direct method for those.

@RoyiAvital do you have the code for that plot? I'm curious about the weird diagonal lines where the overlap-save is faster. Did you use nextfastfft for the FFT-based plots?

galenlynch commented 4 years ago

Much of this is a repeat of what I said on Discourse, but I think the tradeoff between direct and fft or overlap-save is not a question of whether everything is in memory or not, but rather what the kernel size is. FFT convolution has an algorithmic advantage over direct convolution (NlogN vs N^2), so you should only choose direct convolution when the overhead outweighs the complexity savings. Those savings are only increased with overlap-save compared to fft convolution. In my testing, the break-even point between fft convolution and direct convolution happens when the kernel is length 58. You are welcome to compare for yourself as direct convolution is already implemented in DSP.jl in the td_filt function for 1d arrays, which is chosen inside of filt if the kernel is less than 59 elements long. It is already vectorized (i.e. benefits from @simd and @inbounds), and furthermore we have unrolled the inner loop for kernel lengths less than 59.

As I mentioned in Discourse, it might be easy to use the direct convolution filt code for some subsets of conv, since conv is more general than filt at the moment.

Could you please compare your times in Matlab to the DSP.jl times?

RoyiAvital commented 4 years ago

I think you miss interpret my writing. I will try to summarize, but I don't think there is a point to keep debate as the image I pasted above shows real world results.

  1. Whenever the kernel is small enough for the L2 cache of the CPU the favorable memory access pattern of the direct method dominates. For many cases of convolution this is the most often case. Think about differentiation, Savitzky Golay, etc... They are all pretty small convolution kernels. Not to mention in the 2D world. So having direct method to handle those is a must. On a modern CPU it will be the fastest up to ~400-500 samples (Kernel length). Maybe in the Ryzen 3xxx it will be more as it has giant caches.

  2. When the kernel is too large (When I say kernel I mean the kernel and the same length of data form signal) then the superior memory access pattern doesn't dominate and the excessive flops of the direct method kills it. Yet in those cases, usually what will be faster then Overlap and Save is the Frequency Domain approach. It is well threaded and uses less operations.

  3. The overlap and Save dictates both DFT and a lot of memory access (Selection of indices, writing to the array, etc...). Its sweet spot is only when the kernel is too big for the direct method yet the ratio of sizes between the signal and the kernel is so big that padding the kernel and applying FFT has such a cost that Overlap and Save is better. Yet if your Kernel is huge and the signal is huge the factor of N log(N) of the DFT is something like KN log(K N) where K is the ratio. When we get in the big signals world, K is bounded. It can't bee 1000 or something like that. The cases it is so large are really edge cases.

I showed on my code to @ssfrr the for M = 240000 and N = 8640000 still Frequency Domain is faster.

I also ran my analysis on bigger signals to have broader perspective:

image

I also learned from a talk with @ssfrr that your implementation for Frequency Domain Convolution is using nextpow2() to set the length of the zero padding. This is also a big mistake. Pad to the length of the convolution. MATLAB uses FFTW as well, so the results should match.

Again, I think you should look at the results I posted. I think they speak for themselves. The code is also available and you can use it for farther analysis. The analysis was part of my answer - https://dsp.stackexchange.com/a/66947.

Anyhow, just my 2 cents.

galenlynch commented 4 years ago

I understand and agree with your point that direct convolution will be faster than Fourier methods for kernel sizes between 0 and some number. I don't know where that trade-off point is, and I think it's worth keeping in mind how much run-time will be saved. That being said, I'm all for a direct convolution branch in conv, and if you have the time and energy to impelement it, I strongly encourage you to do so. However, I do take some issue with the particulars of you plots above, and the surprising conclusions that you draw from them.

In those plots, you're comparing the runtime of Matlab's conv2 function against the runtime of two of your own Matlab functions. By doing so, you're hoping to compare the underlying algorithms without regard to the implementation details. I don't know the ins and outs of how you implemented those functions, but let's check your implementation against DSP.jl's.

To generate your test times, I ran your code on my computer with convShape = CONVOLUTION_SHAPE_FULL. To compare these times to the current DSP.jl conv, I used the median conv run time as calculated with BenchmarkTools.jl, with about 10,000 runs for each kernel size. For the sake of time, I only compared the results of the last row of your results where the signal length is 2,020. The first thing that jumps out to me is that these measurements look quite noisy: I don't really know how Matlab's timeit works, but perhaps it didn't use a sufficient number of runs.

algorithm_benchmarks

It's worth noting that DSP.jl's conv only uses overlap-save for the first 10 kernels (of length 4-292) with a signal of length 2,020, and beyond that it decides to use dft convolution due to its performance heuristics. Nonetheless, across this range you can see that DSP.jl's overlap-save algorithm performs similarly as Matlab's conv2, and outperforms your overlap-save algorithm many fold. DSP.jl's conv is only up to 60 μs slower than Matlab's conv2, and outperforms it for larger kernels.

However, as you already pointed out, 2,020 is a small number of datapoints for many signals, though it's maybe a relevant size for images. Two seconds of audio data (sampled at 44.1 kS/s) is around 100,000 samples long, so let's see what happens in this regime. Here I'm showing results generated the same way, but for a 100,000 length signal and kernels with lengths between 4 and 2,020. You can see that for almost all these kernels, the DSP.jl conv outperforms Matlab's conv2, though with kernels less than size 90 this is not true. At all kernel sizes, DSP.jl vastly outperforms your overlap-save, as well as your dft method. Here, DSP.jl's conv is using overlap save for all kernel sizes. While Matlab's conv2 does perform better for kernels less than 90 data points long, at worst DSP.jl takes 550 μs longer.

algorithm_benchmarks_large

With all of these tests, we can see that implementation does matter, and that the overlap-save algorithm seems to fare well—in agreement with many sources. Certainly, in some cases a direct convolution algorithm will be faster, though I think that calculating when the increased efficiency of overlap-save will not make up for the extra overhead will be difficult since it depends on both the kernel size and the signal size. I am very hesitant to draw any conclusions about cache-friendliness or thread-friendliness of the underlying algorithms from your Matlab scripts, or indeed from DSP.jl's implementation. I don't know why you think it's better to not use the closest power of 2, 3, or 5 prior to Fourier transformation, which is in contrast to FFTW's recommendations, but if you have some actual benchmarks using Julia code then a PR would be welcome.

Once again, if you or others are motivated to implement direct convolution method, that would be great. However, I haven't been motivated to chase those extra microseconds, given how hard I think it will be to make a direct convolution method that generalizes to arbitrary dimensions and array indexing.

ssfrr commented 4 years ago

@RoyiAvital I'm confused by something on your plots - I'd expect your first plot (up to length 2000) to be the top-left corner of your next plot (up to length 18000) but they're showing different results. The first shows that direct is faster up to kernel lengths of about 500, but the second shows direct being faster up to about 2000 - was there something else different about those two runs?

On Complexity Analysis

Say you have a length N signal and length M filter, where M <= N:

For a fixed kernel length M, varying the signal length N

(columns of @RoyiAvital's diagram above)

For fixed signal length N, varying the kernel length M

(rows of @RoyiAvital's diagram, and the case that @galenlynch plotted above)

Summary

So I think that all the complexity analysis can tell us is that for a fixed kernel length, there's some signal length where FFT is the worst choice, and everything else depends on the constant factors.

As I mentioned in my discourse post, different fields have different common use-cases, so it's probably not productive to think in terms of what sizes are "huge" or "small" or which applications are "rare" or "common", particularly when we already have the hard work of implementing all three methods done.

So I think if someone wants to take this on, so far we still haven't really answered the question of where these cutoffs are. Really we'd want to compare DSP._conv_kern_fft!, DSP.unsafe_conv_kern_os! and DSP.filt! on a variety of sizes (probably on a log scale as in @galenlynch's plot in the linked PR). Even if the criteria depends on both N and M, once we have that plot we could probably just define some lines that slice up the space appropriately and use that as a heuristic.

RoyiAvital commented 4 years ago

@ssfrr , I think all presents know the theory of the asymptotic performance of the 3 methods. In my opinion, as long as you don't know the constant of the asymptotic behavior, it doesn't give you much in practice as you'd need to run everything in practice, so what's the point? If you analysis covered cache size, number of Floating Point ports in the CPU, memory access patterns, memory bandwidth and latency than it would be adding something.

Regarding the cut off point, I wouldn't do a deep research on that. What I'd do is make sure your Frequency Domain function DSP._conv_kern_fft!() doesn't use nextpow2() when signal is longer than, I don't know, around 1024 samples . As far as I understood from @ssfrr it does. Again, while "FFT is optimized for powers of 2" is totally right, it doesn't mean it faster to apply fft on 2^20 than on 2^19 + 150 samples. Since, as far as I understand, you do apply the nextpow2() than you cut off between Frequency Domain to OVS isn't right as well.

@galenlynch , regarding your comment about 2 [Sec] of audio is 100,000 samples. The question is, what do we want to do with this data. If we want to filter it and the kernel we use is also very long, I'd say that in most cases we should use Multi Rate approaches. It is not a good practice to apply LPS with 1 / 32 of the BW for instance on a signal directly as it means the filter kernel will be long. There are tricks to do it properly which keeps the filters smaller and more effective.

Anyhow, I think you have all you need. What's needed is to convert the MATLAB code to Julia. Make sure the code in Julia is as efficient as it could (MATLAB can't get as efficient as Julia):

  1. Don't use nextpow2() in the Frequency Domain Convolution.
  2. Use the symmetric property in the ifft() to make it faster.
  3. Add support for full, same and valid in conv().
  4. Add direct implementation (I think using filt() isn't good here. The convention in the field is that filter() returns the same length as the input signal yet it is different than same. While one act like padding with zeros at the beginning the other at the end). It should be well optimized with SIMD operations. I don't think Multi Threading is important here (As it will be used for small kernels). MATLAB does have MT in conv() yet this is mainly as it only use direct method for all lengths.
galenlynch commented 4 years ago

EDIT: I agree with what @ssfrr said about complexity.

So I think it's a question of judging when the overhead of either overlap-save or FFT convolution is not worth it, and choosing the algorithm appropriately. This is already being done to choose between overlap-save and FFT convolution, and the benchmarks guiding this decision point were from two Julia functions. While we don't have a direct convolution branch yet, my testing with the pretty fast direct convolution that's part of the FIR filtering codebase (and has comparable performance as Matab's conv2) suggests that for large signals with ~1 million samples direct convolution is faster than overlap-save only with kernels of length less than 59, and that the difference in run-times is on the order of tens to hundreds of microseconds. @ssfrr your suggestion of doing this on many scales makes sense.

I think I wasn't quite clear with my post with the benchmarks of DSP.jl and the Matlab comparisons made by @RoyiAvital: I don't buy the idea that your Matlab benchmarks tell us which algorithm to choose in what situation. First of all, your benchmarks are comparing a compiled function that is shipped with Matlab against two of your own functions written in the Matlab language, which is unfair. The direct convolution code in your benchmark uses Matlab's conv2, which is not written in Matlab, but is probably compiled C++. Furthermore, I don't know how you wrote your overlap-save, or FFT convolution functions, but from what I can see it looks like you wrote a slow overlap-save function. Since your premise seems to be that based on your Matlab testing we are using the wrong algorithm and should change what we are doing, then I disagree with you.

From what I can tell, direct convolution would indeed be faster for small filters but only in the cases outlined above, and for marginal performance savings. In case you think it's a problem with our implementation, tdfilt, which calls this code, is as fast or around 1/4 of the speed of Matlab's conv2. @RoyiAvital It sounds like you might be familiar with the inner workings of conv2, and if it's multi-threaded then that would explain the difference.

Multirate stuff is also implemented in DSP.jl... let's just choose the fastest algorithm for conv guided by real-world Julia benchmarks. Comparing your own code written in Matlab against Matlab's C++ code to guide what's happening in DSP.jl is really comparing apples and oranges.

We have an implementation of all three algorithms already, I don't know why you think we should copy your code.

  1. What our code does is use the next product of either 2, 3, 5, or 7 to size the FFT when doing normal FFT convolutions. If you want to improve that, guided by a benchmark with our code, then you are welcome to.
  2. Our code also already does symmetric ifft, if the data is Real.
  3. Your request for full, same and valid might be a good (if not totally unrelated) issue, but right now the code returns full and lets people clip it if they want to. Let's split that off into a separate issue, and keep in mind that we are not trying to copy Matlab.
  4. We already have a version of this, with SIMD support, in tdfilt. It just doesn't have quite the right zero padding at the end, which is a trivial thing to fix. It should really just be changed so that both filt and conv can use the same code. PR welcome.
RoyiAvital commented 4 years ago

@galenlynch , I have no idea why you wrote I suggested you copy my code or anything.

All I wrote is:

  1. There are cases you chose OVS which Direct is faster. You showed it as well in your testing. On my computer, in MATLAB the move from Direct to OVS should happen around ~400 samples. It might be different on different implementations on different CPU's. Your tweaking on your system won't hold for other system as well. But still you can gain from it intuition. I agree it is better to find the cut off length using Julia code and not my MATLAB. Absolutely.
  2. I think the implementation missed full, same and valid functionality.
  3. I think a flag to chose the method (Direct, OVS, DFT) should be added. The default should auto mode which will use the heuristics run time analysis will show. Yet as long as you give the user the option the significance of your tweaking and decisions isn't critical.

My MATLAB implementation of OVS is non allocating and using only FFT (Implemented by FFTW) and indexing. Two operations MATLAB is quite efficient at (Not like Julia but this is as efficient MATLAB can be). Hence it gave us the intuition.

My conclusion from my MATLAB implementation about the gains of OVS being negligible might be different with much more efficient OVS implementation in Julia. But I'd still bet that the cases OVS is superior to both Direct and DFT based are the minority cases.

ssfrr commented 4 years ago

This thread is getting a little unfocused. As I understand it the central claim is that there's a regime where direct convolution is faster than FFT or overlap-save (OVS). I think we can all agree that's true, but someone needs to run the benchmarks to figure out where that threshold is. Improving the FFT and OVS implementations and API changes can be a separate issue (and I agree that there can be API improvements). I'm not sure if there's much more to discuss until we see those benchmarks, using the Julia implementations.

galenlynch commented 4 years ago

I agree with the central claim, and also that the API could be improved. Sorry if I got a bit aggressive in my response.

rafaqz commented 3 years ago

To add another data point to this discussion: I'm not across the technical details you are discussing here, but I was surprised that practically the direct windowing in DynamicGrids.jl runs 3*3 StaticArray convolutions 1.5-3x faster than DSP.jl, and it wasn't written to do convolutions!

It's hard to benchmark precisely because of differences in initial allocations and setup - if DSP.jl had a pre-allocated conv! method this would be more reliable. This is for 1000*1000 arrays, but increasingly favours DynamicGrids if the array size gets any smaller.

using DynamicGrids, 
      LinearAlgebra, 
      StaticArrays, 
      BenchmarkTools, 
      DSP

s = 3; x = 1000
A = rand(Float32, x, x)
kernel = rand(Float32, s, s)
sa_kernel = SMatrix{s,s}(kernel)
o = ResultOutput(A; tspan=1:101) # 100 frames, the first tstep is init
rule = Convolution(DynamicGrids.Kernel(sa_kernel))

# DynamicGrids with StaticArray kernel
@btime sim!($o, $rule)
#  457.183 ms (1595 allocations: 7.81 MiB)

# DSP with regular Array kernel
@btime for i = 1:100 conv($A, $kernel); end
#  1.488 s (10700 allocations: 383.84 MiB)

# DSP with StaticArray kernel
@btime for i = 1:100 conv($A, $sa_kernel); end
#  1.632 s (11100 allocations: 383.86 MiB)

# Running the whole simulation setup for every frame - which 
# allocates a lot and isn't type stable (it's not designed to run like 
# this), DynamicGrids.jl is stil 1.5x faster:
o = ResultOutput(A; tspan=1:2)
@btime for i in 1:100 sim!($o, $rule) end
#  1.080 s (11000 allocations: 768.13 MiB)

# Single step with preallocated padded arrays - so less allocation, 
# but some simulation setup time aded:
@btime for i in 1:100 sim!($o, $rule; simdata=$simdata) end
#  826.144 ms (2300 allocations: 142.19 KiB)

You can probably optimise this further if you know you will only be using it for dot multiplication.

galenlynch commented 3 years ago

Yeah I have a PR that makes conv in DSP.jl much faster for many use cases by adding direct convolution and also using multithreading. It's a bit stalled at the moment due to my own lack of time and my uncertainty over the design of the system that selects which convolution algorithm to use, since the number of algorithms is something like 10.