Closed galenlynch closed 1 week ago
We have a time-domain implementation as part of filt
, as well as a heuristic for when to do time-domain or FFT. Seems like we should unify filt
and conv
, and then combine the heuristics to pick from the 3 methods (time-domain, full FFT, and overlap-add FFT).
Yeah I 100% agree, my main motivation for adding overlp-save to conv was to pave the way to combine filt
and conv
, at least for FIR filtering. However, the devil is in the details: extrapolating the 1D direct convolution in filt
to ND with abstract arrays seems like it would be just as fiddly as overlap save was.
I'd really love to have a conv
fallback that is a generic as possible, being usable with any type combination that supports *
and +
, i.e. letting collect
figure out the result element type. I had made a start somewhere, but hadn't really pursued it. Maybe I can give it another shot. But we might still want an implementation for a limited set of types that is potentially faster.
So here is a potential fallback implementation:
julia> function directconv(A::AbstractArray{<:Any,M}, B::AbstractArray{<:Any,N}) where {M,N}
axes_A = ntuple(i -> axes(A, i), Val(max(M,N)))
axes_B = ntuple(i -> axes(B, i), Val(max(M,N)))
krange(i) = CartesianIndices((:).(max.(first.(axes_B), Tuple(i) .- last.(axes_A)), min.(last.(axes_B), Tuple(i) .- first.(axes_A))))
return [sum(A[i-k]*B[k] for k in krange(i)) for i in CartesianIndices((:).(first.(axes_A).+first.(axes_B), last.(axes_A).+last.(axes_B)))]
end
directconv (generic function with 1 method)
julia> A = rand(3,4); B = rand(5,6,7);
julia> @btime conv($A,$B);
112.307 μs (178 allocations: 72.63 KiB)
julia> @btime directconv($A,$B);
20.673 μs (885 allocations: 52.00 KiB)
From a brief glance, it's type-stable and results agree with conv
. The huge number of allocations are due to the generator inside sum
which needs to capture A
and B
and hence has to be heap allocated.
I figured out a fast way to do direct convolution that uses SIMD. It doesn't support weird indexing at the moment, or arrays of different number of dimensions, but it's fast.
julia> function _conv_kern_direct!(out, u, v)
fill!(out, 0)
u_region = CartesianIndices(u)
v_region = CartesianIndices(v)
one_index = oneunit(first(u_region))
for vindex in v_region
@simd for uindex in u_region
@inbounds out[uindex + vindex - one_index] += u[uindex] * v[vindex]
end
end
out
end
function _conv_kern_direct(
u::AbstractArray{T, N}, v::AbstractArray{S, N}, su, sv) where {T, S, N}
sout = su .+ sv .- 1
out = similar(u, promote_type(T, S), sout)
_conv_kern_direct!(out, u, v)
end
julia> sa = (2,3,4); sb = (5,6,7); a = rand(sa...); b = rand(sb...);
julia> @benchmark conv($b, $a)
BenchmarkTools.Trial:
memory estimate: 27.78 KiB
allocs estimate: 164
--------------
minimum time: 86.637 μs (0.00% GC)
median time: 99.368 μs (0.00% GC)
mean time: 143.282 μs (7.74% GC)
maximum time: 67.902 ms (95.95% GC)
--------------
samples: 10000
evals/sample: 1
julia> @benchmark directconv($b, $a)
BenchmarkTools.Trial:
memory estimate: 56.52 KiB
allocs estimate: 963
--------------
minimum time: 25.700 μs (0.00% GC)
median time: 27.732 μs (0.00% GC)
mean time: 39.263 μs (22.48% GC)
maximum time: 5.374 ms (99.37% GC)
--------------
samples: 10000
evals/sample: 1
julia> @benchmark _conv_kern_direct($b, $a, $sb, $sa)
BenchmarkTools.Trial:
memory estimate: 3.88 KiB
allocs estimate: 1
--------------
minimum time: 5.357 μs (0.00% GC)
median time: 5.909 μs (0.00% GC)
mean time: 8.284 μs (18.72% GC)
maximum time: 11.008 ms (99.89% GC)
--------------
samples: 10000
evals/sample: 6
It's also competitive with conv
for larger inputs:
julia> sa = (1000, 1000, 3); sb = (3, 3, 3); a = rand(sa...); b = rand(sb...);
julia> @btime conv($a, $b);
135.281 ms (165 allocations: 38.32 MiB)
julia> @btime directconv($a, $b);
498.466 ms (10040044 allocations: 574.50 MiB)
julia> @btime _conv_kern_direct($a, $b, $sa, $sb);
146.988 ms (2 allocations: 38.30 MiB)
This is also with the overlap-save version of conv
, meaning the same convolution with the FFT-only version of conv
would be slower:
julia> using DSP: _conv_kern_fft!, nextfastfft
julia> sout = sa .+ sb .- 1; out = zeros(sout); out = zeros(sout); nffts = nextfastfft(sout);
julia> @btime _conv_kern_fft!($out, $a, $b, $sa, $sb, $sout, $nffts);
348.628 ms (172 allocations: 232.88 MiB)
Because it's using SIMD, you get a ~2x speedup for single precision float, and a ~4x speedup for Int16 etc.
Also with the same test data:
julia> using ImageFiltering
julia> @btime imfilter($a, ($b,));
171.826 ms (31 allocations: 68.96 MiB)
It would be great if we had a direct convolution kernel, which would probably be faster for small convolutions.