Open RoyiAvital opened 4 years ago
See this similar plot (but for longer filters) that @galenlynch made that compares regular FFT processing to overlap-save. I think that the thing to do is to add another check for short convolutions that uses the direct method for those.
@RoyiAvital do you have the code for that plot? I'm curious about the weird diagonal lines where the overlap-save is faster. Did you use nextfastfft
for the FFT-based plots?
Much of this is a repeat of what I said on Discourse, but I think the tradeoff between direct and fft or overlap-save is not a question of whether everything is in memory or not, but rather what the kernel size is. FFT convolution has an algorithmic advantage over direct convolution (NlogN vs N^2), so you should only choose direct convolution when the overhead outweighs the complexity savings. Those savings are only increased with overlap-save compared to fft convolution. In my testing, the break-even point between fft convolution and direct convolution happens when the kernel is length 58. You are welcome to compare for yourself as direct convolution is already implemented in DSP.jl in the td_filt
function for 1d arrays, which is chosen inside of filt
if the kernel is less than 59 elements long. It is already vectorized (i.e. benefits from @simd
and @inbounds
), and furthermore we have unrolled the inner loop for kernel lengths less than 59.
As I mentioned in Discourse, it might be easy to use the direct convolution filt
code for some subsets of conv
, since conv
is more general than filt
at the moment.
Could you please compare your times in Matlab to the DSP.jl times?
I think you miss interpret my writing. I will try to summarize, but I don't think there is a point to keep debate as the image I pasted above shows real world results.
Whenever the kernel is small enough for the L2 cache of the CPU the favorable memory access pattern of the direct method dominates. For many cases of convolution this is the most often case. Think about differentiation, Savitzky Golay, etc... They are all pretty small convolution kernels. Not to mention in the 2D world. So having direct method to handle those is a must. On a modern CPU it will be the fastest up to ~400-500 samples (Kernel length). Maybe in the Ryzen 3xxx it will be more as it has giant caches.
When the kernel is too large (When I say kernel I mean the kernel and the same length of data form signal) then the superior memory access pattern doesn't dominate and the excessive flops of the direct method kills it. Yet in those cases, usually what will be faster then Overlap and Save is the Frequency Domain approach. It is well threaded and uses less operations.
The overlap and Save dictates both DFT and a lot of memory access (Selection of indices, writing to the array, etc...). Its sweet spot is only when the kernel is too big for the direct method yet the ratio of sizes between the signal and the kernel is so big that padding the kernel and applying FFT has such a cost that Overlap and Save is better. Yet if your Kernel is huge and the signal is huge the factor of N log(N)
of the DFT is something like KN log(K N)
where K
is the ratio. When we get in the big signals world, K
is bounded. It can't bee 1000
or something like that. The cases it is so large are really edge cases.
I showed on my code to @ssfrr the for M = 240000
and N = 8640000
still Frequency Domain is faster.
I also ran my analysis on bigger signals to have broader perspective:
I also learned from a talk with @ssfrr that your implementation for Frequency Domain Convolution is using nextpow2()
to set the length of the zero padding. This is also a big mistake. Pad to the length of the convolution. MATLAB uses FFTW
as well, so the results should match.
Again, I think you should look at the results I posted. I think they speak for themselves. The code is also available and you can use it for farther analysis. The analysis was part of my answer - https://dsp.stackexchange.com/a/66947.
Anyhow, just my 2 cents.
I understand and agree with your point that direct convolution will be faster than Fourier methods for kernel sizes between 0 and some number. I don't know where that trade-off point is, and I think it's worth keeping in mind how much run-time will be saved. That being said, I'm all for a direct convolution branch in conv
, and if you have the time and energy to impelement it, I strongly encourage you to do so. However, I do take some issue with the particulars of you plots above, and the surprising conclusions that you draw from them.
In those plots, you're comparing the runtime of Matlab's conv2
function against the runtime of two of your own Matlab functions. By doing so, you're hoping to compare the underlying algorithms without regard to the implementation details. I don't know the ins and outs of how you implemented those functions, but let's check your implementation against DSP.jl's.
To generate your test times, I ran your code on my computer with convShape = CONVOLUTION_SHAPE_FULL
. To compare these times to the current DSP.jl conv
, I used the median conv
run time as calculated with BenchmarkTools.jl, with about 10,000 runs for each kernel size. For the sake of time, I only compared the results of the last row of your results where the signal length is 2,020. The first thing that jumps out to me is that these measurements look quite noisy: I don't really know how Matlab's timeit
works, but perhaps it didn't use a sufficient number of runs.
It's worth noting that DSP.jl's conv
only uses overlap-save for the first 10 kernels (of length 4-292) with a signal of length 2,020, and beyond that it decides to use dft convolution due to its performance heuristics. Nonetheless, across this range you can see that DSP.jl's overlap-save algorithm performs similarly as Matlab's conv2
, and outperforms your overlap-save algorithm many fold. DSP.jl's conv
is only up to 60 μs slower than Matlab's conv2
, and outperforms it for larger kernels.
However, as you already pointed out, 2,020 is a small number of datapoints for many signals, though it's maybe a relevant size for images. Two seconds of audio data (sampled at 44.1 kS/s) is around 100,000 samples long, so let's see what happens in this regime. Here I'm showing results generated the same way, but for a 100,000 length signal and kernels with lengths between 4 and 2,020. You can see that for almost all these kernels, the DSP.jl conv
outperforms Matlab's conv2
, though with kernels less than size 90 this is not true. At all kernel sizes, DSP.jl vastly outperforms your overlap-save, as well as your dft method. Here, DSP.jl's conv
is using overlap save for all kernel sizes. While Matlab's conv2
does perform better for kernels less than 90 data points long, at worst DSP.jl takes 550 μs longer.
With all of these tests, we can see that implementation does matter, and that the overlap-save algorithm seems to fare well—in agreement with many sources. Certainly, in some cases a direct convolution algorithm will be faster, though I think that calculating when the increased efficiency of overlap-save will not make up for the extra overhead will be difficult since it depends on both the kernel size and the signal size. I am very hesitant to draw any conclusions about cache-friendliness or thread-friendliness of the underlying algorithms from your Matlab scripts, or indeed from DSP.jl's implementation. I don't know why you think it's better to not use the closest power of 2, 3, or 5 prior to Fourier transformation, which is in contrast to FFTW's recommendations, but if you have some actual benchmarks using Julia code then a PR would be welcome.
Once again, if you or others are motivated to implement direct convolution method, that would be great. However, I haven't been motivated to chase those extra microseconds, given how hard I think it will be to make a direct convolution method that generalizes to arbitrary dimensions and array indexing.
@RoyiAvital I'm confused by something on your plots - I'd expect your first plot (up to length 2000) to be the top-left corner of your next plot (up to length 18000) but they're showing different results. The first shows that direct is faster up to kernel lengths of about 500, but the second shows direct being faster up to about 2000 - was there something else different about those two runs?
Say you have a length N
signal and length M
filter, where M <= N
:
N*M
N*log(N)
(because the filter is zero-padded to the length of the signal)N*log(M)
, because you have order N/M
blocks, and each block takes order M*log(M)
M
, varying the signal length N
(columns of @RoyiAvital's diagram above)
N
, so it's not clear which should be faster, even asymptoticallyN*log(N)
, so regardless of the filter size there should be some signal length where FFT is the worst choice.N
, varying the kernel length M
(rows of @RoyiAvital's diagram, and the case that @galenlynch plotted above)
N
is a constant factor, so the cost of FFT convolution is constant.M
, but OVS is order log(M)
. The rub here is that M
is bounded to be less than N
, so asymptotic behavior isn't really relevant.So I think that all the complexity analysis can tell us is that for a fixed kernel length, there's some signal length where FFT is the worst choice, and everything else depends on the constant factors.
As I mentioned in my discourse post, different fields have different common use-cases, so it's probably not productive to think in terms of what sizes are "huge" or "small" or which applications are "rare" or "common", particularly when we already have the hard work of implementing all three methods done.
So I think if someone wants to take this on, so far we still haven't really answered the question of where these cutoffs are. Really we'd want to compare DSP._conv_kern_fft!
, DSP.unsafe_conv_kern_os!
and DSP.filt!
on a variety of sizes (probably on a log scale as in @galenlynch's plot in the linked PR). Even if the criteria depends on both N
and M
, once we have that plot we could probably just define some lines that slice up the space appropriately and use that as a heuristic.
@ssfrr , I think all presents know the theory of the asymptotic performance of the 3 methods. In my opinion, as long as you don't know the constant of the asymptotic behavior, it doesn't give you much in practice as you'd need to run everything in practice, so what's the point? If you analysis covered cache size, number of Floating Point ports in the CPU, memory access patterns, memory bandwidth and latency than it would be adding something.
Regarding the cut off point, I wouldn't do a deep research on that.
What I'd do is make sure your Frequency Domain function DSP._conv_kern_fft!()
doesn't use nextpow2()
when signal is longer than, I don't know, around 1024 samples . As far as I understood from @ssfrr it does. Again, while "FFT is optimized for powers of 2" is totally right, it doesn't mean it faster to apply fft
on 2^20
than on 2^19 + 150
samples.
Since, as far as I understand, you do apply the nextpow2()
than you cut off between Frequency Domain to OVS isn't right as well.
@galenlynch , regarding your comment about 2 [Sec] of audio is 100,000 samples. The question is, what do we want to do with this data. If we want to filter it and the kernel we use is also very long, I'd say that in most cases we should use Multi Rate approaches. It is not a good practice to apply LPS with 1 / 32 of the BW for instance on a signal directly as it means the filter kernel will be long. There are tricks to do it properly which keeps the filters smaller and more effective.
Anyhow, I think you have all you need. What's needed is to convert the MATLAB code to Julia. Make sure the code in Julia is as efficient as it could (MATLAB can't get as efficient as Julia):
nextpow2()
in the Frequency Domain Convolution.symmetric
property in the ifft()
to make it faster.full
, same
and valid
in conv()
.filt()
isn't good here. The convention in the field is that filter()
returns the same length as the input signal yet it is different than same
. While one act like padding with zeros at the beginning the other at the end). It should be well optimized with SIMD operations. I don't think Multi Threading is important here (As it will be used for small kernels). MATLAB does have MT in conv()
yet this is mainly as it only use direct method for all lengths.EDIT: I agree with what @ssfrr said about complexity.
So I think it's a question of judging when the overhead of either overlap-save or FFT convolution is not worth it, and choosing the algorithm appropriately. This is already being done to choose between overlap-save and FFT convolution, and the benchmarks guiding this decision point were from two Julia functions. While we don't have a direct convolution branch yet, my testing with the pretty fast direct convolution that's part of the FIR filtering codebase (and has comparable performance as Matab's conv2
) suggests that for large signals with ~1 million samples direct convolution is faster than overlap-save only with kernels of length less than 59, and that the difference in run-times is on the order of tens to hundreds of microseconds. @ssfrr your suggestion of doing this on many scales makes sense.
I think I wasn't quite clear with my post with the benchmarks of DSP.jl and the Matlab comparisons made by @RoyiAvital: I don't buy the idea that your Matlab benchmarks tell us which algorithm to choose in what situation. First of all, your benchmarks are comparing a compiled function that is shipped with Matlab against two of your own functions written in the Matlab language, which is unfair. The direct convolution code in your benchmark uses Matlab's conv2
, which is not written in Matlab, but is probably compiled C++. Furthermore, I don't know how you wrote your overlap-save, or FFT convolution functions, but from what I can see it looks like you wrote a slow overlap-save function. Since your premise seems to be that based on your Matlab testing we are using the wrong algorithm and should change what we are doing, then I disagree with you.
From what I can tell, direct convolution would indeed be faster for small filters but only in the cases outlined above, and for marginal performance savings. In case you think it's a problem with our implementation, tdfilt
, which calls this code, is as fast or around 1/4 of the speed of Matlab's conv2
. @RoyiAvital It sounds like you might be familiar with the inner workings of conv2
, and if it's multi-threaded then that would explain the difference.
Multirate stuff is also implemented in DSP.jl... let's just choose the fastest algorithm for conv
guided by real-world Julia benchmarks. Comparing your own code written in Matlab against Matlab's C++ code to guide what's happening in DSP.jl is really comparing apples and oranges.
We have an implementation of all three algorithms already, I don't know why you think we should copy your code.
ifft
, if the data is Real. full
, same
and valid
might be a good (if not totally unrelated) issue, but right now the code returns full
and lets people clip it if they want to. Let's split that off into a separate issue, and keep in mind that we are not trying to copy Matlab.tdfilt
. It just doesn't have quite the right zero padding at the end, which is a trivial thing to fix. It should really just be changed so that both filt
and conv
can use the same code. PR welcome.@galenlynch , I have no idea why you wrote I suggested you copy my code or anything.
All I wrote is:
full
, same
and valid
functionality.auto
mode which will use the heuristics run time analysis will show. Yet as long as you give the user the option the significance of your tweaking and decisions isn't critical.My MATLAB implementation of OVS is non allocating and using only FFT (Implemented by FFTW) and indexing. Two operations MATLAB is quite efficient at (Not like Julia but this is as efficient MATLAB can be). Hence it gave us the intuition.
My conclusion from my MATLAB implementation about the gains of OVS being negligible might be different with much more efficient OVS implementation in Julia. But I'd still bet that the cases OVS is superior to both Direct and DFT based are the minority cases.
This thread is getting a little unfocused. As I understand it the central claim is that there's a regime where direct convolution is faster than FFT or overlap-save (OVS). I think we can all agree that's true, but someone needs to run the benchmarks to figure out where that threshold is. Improving the FFT and OVS implementations and API changes can be a separate issue (and I agree that there can be API improvements). I'm not sure if there's much more to discuss until we see those benchmarks, using the Julia implementations.
I agree with the central claim, and also that the API could be improved. Sorry if I got a bit aggressive in my response.
To add another data point to this discussion: I'm not across the technical details you are discussing here, but I was surprised that practically the direct windowing in DynamicGrids.jl runs 3*3 StaticArray
convolutions 1.5-3x faster than DSP.jl, and it wasn't written to do convolutions!
It's hard to benchmark precisely because of differences in initial allocations and setup - if DSP.jl had a pre-allocated conv!
method this would be more reliable. This is for 1000*1000 arrays, but increasingly favours DynamicGrids if the array size gets any smaller.
using DynamicGrids,
LinearAlgebra,
StaticArrays,
BenchmarkTools,
DSP
s = 3; x = 1000
A = rand(Float32, x, x)
kernel = rand(Float32, s, s)
sa_kernel = SMatrix{s,s}(kernel)
o = ResultOutput(A; tspan=1:101) # 100 frames, the first tstep is init
rule = Convolution(DynamicGrids.Kernel(sa_kernel))
# DynamicGrids with StaticArray kernel
@btime sim!($o, $rule)
# 457.183 ms (1595 allocations: 7.81 MiB)
# DSP with regular Array kernel
@btime for i = 1:100 conv($A, $kernel); end
# 1.488 s (10700 allocations: 383.84 MiB)
# DSP with StaticArray kernel
@btime for i = 1:100 conv($A, $sa_kernel); end
# 1.632 s (11100 allocations: 383.86 MiB)
# Running the whole simulation setup for every frame - which
# allocates a lot and isn't type stable (it's not designed to run like
# this), DynamicGrids.jl is stil 1.5x faster:
o = ResultOutput(A; tspan=1:2)
@btime for i in 1:100 sim!($o, $rule) end
# 1.080 s (11000 allocations: 768.13 MiB)
# Single step with preallocated padded arrays - so less allocation,
# but some simulation setup time aded:
@btime for i in 1:100 sim!($o, $rule; simdata=$simdata) end
# 826.144 ms (2300 allocations: 142.19 KiB)
You can probably optimise this further if you know you will only be using it for dot multiplication.
Yeah I have a PR that makes conv in DSP.jl much faster for many use cases by adding direct convolution and also using multithreading. It's a bit stalled at the moment due to my own lack of time and my uncertainty over the design of the system that selects which convolution algorithm to use, since the number of algorithms is something like 10.
The current implementation uses in many cases the Overlap and Save method for Linear Convolution. While this method is optimized to a streamed data it is not optimized for cases where all data is given in memory.
See https://discourse.julialang.org/t/convolution-conv-with-same-size-output/38260/11:
I think it is better to implement direct loop with
@simd
and@inbounds
. It will probably be faster than the current method for most cases. For longer kernels / signals it should use the Frequency Domain as it is implemented now,