Direct colvolution - Githubissues

galenlynch commented 5 years ago

It would be great if we had a direct convolution kernel, which would probably be faster for small convolutions.

ssfrr commented 5 years ago

We have a time-domain implementation as part of filt, as well as a heuristic for when to do time-domain or FFT. Seems like we should unify filt and conv, and then combine the heuristics to pick from the 3 methods (time-domain, full FFT, and overlap-add FFT).

galenlynch commented 5 years ago

Yeah I 100% agree, my main motivation for adding overlp-save to conv was to pave the way to combine filt and conv, at least for FIR filtering. However, the devil is in the details: extrapolating the 1D direct convolution in filt to ND with abstract arrays seems like it would be just as fiddly as overlap save was.

martinholters commented 5 years ago

I'd really love to have a conv fallback that is a generic as possible, being usable with any type combination that supports * and +, i.e. letting collect figure out the result element type. I had made a start somewhere, but hadn't really pursued it. Maybe I can give it another shot. But we might still want an implementation for a limited set of types that is potentially faster.

martinholters commented 5 years ago

So here is a potential fallback implementation:

julia> function directconv(A::AbstractArray{<:Any,M}, B::AbstractArray{<:Any,N}) where {M,N}
           axes_A = ntuple(i -> axes(A, i), Val(max(M,N)))
           axes_B = ntuple(i -> axes(B, i), Val(max(M,N)))
           krange(i) = CartesianIndices((:).(max.(first.(axes_B), Tuple(i) .- last.(axes_A)), min.(last.(axes_B), Tuple(i) .- first.(axes_A))))
           return [sum(A[i-k]*B[k] for k in krange(i)) for i in CartesianIndices((:).(first.(axes_A).+first.(axes_B), last.(axes_A).+last.(axes_B)))]
       end
directconv (generic function with 1 method)

julia> A = rand(3,4); B = rand(5,6,7);

julia> @btime conv($A,$B);
  112.307 μs (178 allocations: 72.63 KiB)

julia> @btime directconv($A,$B); 
  20.673 μs (885 allocations: 52.00 KiB)

From a brief glance, it's type-stable and results agree with conv. The huge number of allocations are due to the generator inside sum which needs to capture A and B and hence has to be heap allocated.

galenlynch commented 5 years ago

I figured out a fast way to do direct convolution that uses SIMD. It doesn't support weird indexing at the moment, or arrays of different number of dimensions, but it's fast.

julia> function _conv_kern_direct!(out, u, v)
    fill!(out, 0)
    u_region = CartesianIndices(u)
    v_region = CartesianIndices(v)
    one_index = oneunit(first(u_region))
    for vindex in v_region
        @simd for uindex in u_region
            @inbounds out[uindex + vindex - one_index] += u[uindex] * v[vindex]
        end
    end
    out
end

function _conv_kern_direct(
    u::AbstractArray{T, N}, v::AbstractArray{S, N}, su, sv) where {T, S, N}
    sout = su .+ sv .- 1
    out = similar(u, promote_type(T, S), sout)
    _conv_kern_direct!(out, u, v)
end

julia> sa = (2,3,4); sb = (5,6,7); a = rand(sa...); b = rand(sb...);

julia> @benchmark conv($b, $a)
BenchmarkTools.Trial: 
  memory estimate:  27.78 KiB
  allocs estimate:  164
  --------------
  minimum time:     86.637 μs (0.00% GC)
  median time:      99.368 μs (0.00% GC)
  mean time:        143.282 μs (7.74% GC)
  maximum time:     67.902 ms (95.95% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark directconv($b, $a)
BenchmarkTools.Trial: 
  memory estimate:  56.52 KiB
  allocs estimate:  963
  --------------
  minimum time:     25.700 μs (0.00% GC)
  median time:      27.732 μs (0.00% GC)
  mean time:        39.263 μs (22.48% GC)
  maximum time:     5.374 ms (99.37% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark _conv_kern_direct($b, $a, $sb, $sa)
BenchmarkTools.Trial: 
  memory estimate:  3.88 KiB
  allocs estimate:  1
  --------------
  minimum time:     5.357 μs (0.00% GC)
  median time:      5.909 μs (0.00% GC)
  mean time:        8.284 μs (18.72% GC)
  maximum time:     11.008 ms (99.89% GC)
  --------------
  samples:          10000
  evals/sample:     6

It's also competitive with conv for larger inputs:

julia> sa = (1000, 1000, 3); sb = (3, 3, 3); a = rand(sa...); b = rand(sb...);

julia> @btime conv($a, $b);
  135.281 ms (165 allocations: 38.32 MiB)

julia> @btime directconv($a, $b);
  498.466 ms (10040044 allocations: 574.50 MiB)

julia> @btime  _conv_kern_direct($a, $b, $sa, $sb);
  146.988 ms (2 allocations: 38.30 MiB)

This is also with the overlap-save version of conv, meaning the same convolution with the FFT-only version of conv would be slower:

julia> using DSP: _conv_kern_fft!, nextfastfft

julia> sout = sa .+ sb .- 1; out = zeros(sout); out = zeros(sout); nffts = nextfastfft(sout);

julia> @btime _conv_kern_fft!($out, $a, $b, $sa, $sb, $sout, $nffts);
  348.628 ms (172 allocations: 232.88 MiB)

Because it's using SIMD, you get a ~2x speedup for single precision float, and a ~4x speedup for Int16 etc.

Also with the same test data:

julia> using ImageFiltering

julia> @btime imfilter($a, ($b,));
  171.826 ms (31 allocations: 68.96 MiB)

JuliaDSP / DSP.jl

Direct colvolution #292