JuliaDSP / DSP.jl

Filter design, periodograms, window functions, and other digital signal processing functionality
https://docs.juliadsp.org/dev/
Other
374 stars 108 forks source link

Implement time-domain convolution and use it for integers #545

Open martinholters opened 4 months ago

martinholters commented 4 months ago

This does the very minimal thing to get discussion started. I think we should do the time-domain convolution in more cases, but before going in that direction, I'd first like to get feedback.

In particular, for "small" u and v, time-domain convolution would be more efficient and I see no reason not to do it. It just takes a bit of benchmarking to figure out what "small" should be exactly, here. And it would also fix this minor issue:

julia> conv(zeros(0), zeros(0))
ERROR: ArgumentError: invalid Array dimensions

julia> conv(zeros(0), zeros(1))
ERROR: InexactError: trunc(Int64, -Inf)

julia> conv(zeros(1), zeros(0))
ERROR: InexactError: trunc(Int64, -Inf)

The time-domain convolution form this PR does the right thing:

julia> conv(zeros(Int, 0), zeros(Int, 0))
Int64[]

julia> conv(zeros(Int, 0), zeros(Int, 1))
Int64[]

julia> conv(zeros(Int, 1), zeros(Int, 0))
Int64[]

Another point worth considering is whether to use time-domain convolution for more input types. Consider

julia> conv(zeros(BigFloat, 1), zeros(BigFloat, 1))
ERROR: type BigFloat not supported

julia> conv(zeros(Rational{Int}, 1), zeros(Rational{Int}, 1))
1-element Vector{Float64}:
 0.0

For BigFloat, this would change from throwing an error to working, but potentially slow, which I'd consider an improvement. For Rational, I wonder whether doing the computation on Rationals (and also returning Rationals) wouldn't be better. The extreme measure would be to do time-domain convolution by default and only use FFT-based approaches for Float32 and Float64 input (above a certain problem size), which would still cover most cases in practice, I guess. Opinions?

Note that if we decide to do time-domain convolution in more cases, we can always do that in later PRs; this one could stand on its own.

Closes #411, fixes #410.

codecov[bot] commented 4 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 97.79%. Comparing base (76001f4) to head (b127ca2).

Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #545 +/- ## ========================================== + Coverage 97.74% 97.79% +0.04% ========================================== Files 18 18 Lines 3199 3220 +21 ========================================== + Hits 3127 3149 +22 + Misses 72 71 -1 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

wheeheee commented 4 months ago

Should tdconv be exported just like tdfilt? Then users could more easily choose the cutoff themselves.

martinholters commented 4 months ago

maybe Val(N) would help with type stability

It does for N>10, which may not be too relevant, but then again ... why not?

martinholters commented 4 months ago

Should tdconv be exported just like tdfilt? Then users could more easily choose the cutoff themselves.

Hm. Export tdconv and fdconv (names subject to bike-shedding) and have conv do some best-effort choosing among them? Or add kwarg to conv to force a choice instead of the additional exports?

wheeheee commented 4 months ago

I kind of like the names tdconv and fdconv, but maybe adding a kwarg is better, since people might also want to choose the overlap-save algorithm and that's one more function.

martinholters commented 4 months ago

I've implemented the keyword-argument-based algorithm selection. I'm not entirely sure I like it, but the first impression is okayish.

The algorithm set to :auto means "select based on problem size", which is the default for types known to be reasonably FFT-able (Float32, Float64, and their Complex versions). For other types the default is :direct. This lets user opt into :auto for other types at their own risk, but provides a sane default.

wheeheee commented 4 months ago

I took reference to FastConv.jl (licensed under the "MIT License (Expat)") and wrote a version with an explicit loop over CartesianIndices for 1-based arrays (unsure about offset indices). Here it is:

function _tdconv1!(out, E::AbstractArray{<:Number,N}, k::AbstractArray{<:Number,N}) where {N}
    output_indices = CartesianIndices(ntuple(i -> 1:(size(E, i)+size(k, i)-1), Val(N)))
    checkbounds(out, output_indices)
    fill!(out, 0)
    offset = CartesianIndex(ntuple(_ -> 1, Val(N)))
    for j in CartesianIndices(E), i in CartesianIndices(k)
        out[i+j-offset] = muladd(E[j], k[i], out[i+j-offset])
    end
    return out
end

Benchmarks:

julia> x = rand(100); y = rand(100); out = rand(199);

julia> @benchmark _conv_td!($out, $x, $y)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  9.000 μs …  38.000 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     9.100 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   9.216 μs ± 602.871 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

      █    █                                            ▃     ▂
  ▅▁▁▁█▁▁▁▁█▁▁▁▁█▁▁▁▁▃▁▁▁▁▁▁▁▁▁▃▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▇▁▁▁▁█▁▁▁█ █
  9 μs         Histogram: log(frequency) by time      10.2 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark _tdconv1!($out, $x, $y)
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
 Range (min … max):  1.510 μs …   5.720 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.520 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.583 μs ± 101.580 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

     █                                                   ▂
  ▄▁▁█▁▁▃▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▁▁▄▁▁█▁▁▂ ▂
  1.51 μs         Histogram: frequency by time        1.69 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
martinholters commented 4 months ago

Yep. I used the generator-based version because it allowed to derive the output type based on the actual values. With the latest change from _conv_td to _conv_td!, a simple loop like you have provided is probably better. Once we're sure about the API, I'll try to optimize, probably by switching to such a loop.

wheeheee commented 4 months ago

I've implemented the keyword-argument-based algorithm selection. I'm not entirely sure I like it, but the first impression is okayish.

If you're referring to the somewhat messy organization in conv, rather than the heuristic used or the ergonomics of the keyword arguments, I have a commit ready to fix that up, if you'd like.

martinholters commented 4 months ago

If you're referring to the somewhat messy organization in conv, rather than the heuristic used or the ergonomics of the keyword arguments, I have a commit ready to fix that up, if you'd like.

I was primarily referring to the keyword-based API. Once that's settled, the organization of the conv methods can be streamlined. And the heuristic when to choose time vs frequency domain is mostly a placeholder for now, that needs adjustment either way, but that's somewhat orthogonal.

So let's discuss API. Some random thoughts I have gathered while working on this:

I think the current kwargs approach addresses the first three items, although choosing between :auto and :direct based on the eltype might come a bit surprising. I'm not sure I like that. The alternative would be two different auto-like keywords, I guess.

wheeheee commented 4 months ago

My opinion, for what it's worth:

martinholters commented 4 months ago

I'm no fan of @warnings, unless there is also an option to say "I know what I'm doing, no need to flood me with warnings". And including that option would make the API somewhat clunky, in my opinion.

Otherwise, this is slowly getting shape -- partially in code, partially in my head for now. One thing I haven't made up mind about is how conv! should deal with offset axes. A short recap for conv:

The two cases are inconsistent with each other for the input-indices-start-at-1 case, but each one individually is very much the only reasonable choice, so that's where we are. Now if the output provided to conv! matches what conv returns, all is well, but what if not. If the output has a OneTo axis, just storing the results starting at 1 seems ok, I'd say. If the output and inputs have offsets, but they don't "match", if all required indices are present in the output, storing the result there and zeroing the rest sounds reasonable. If the result does not fit into the output, a BoundsError seems warranted. A silent truncation might also be considered. But what if all inputs are OneTos but the output is not? Put the results starting at 2, starting at 1, or error?

Of course, erroring for unclear cases would open the possibility to decide later on, so that might be the safest route for now.

wheeheee commented 4 months ago

Sounds good. Another thought I had regarding commutativity and axes, while reading this after prematurely micro-optimizing the loop: even if we consider * potentially non-commutative, I suspect we could still retain the performance benefits of swapping u and v by simply changing the iteration order of the CartesianIndices.

However, as it is, there is the potential for type instability if the axes of each CartesianIndices are not of the same type, although union splitting may automatically take care of that. If this is implemented, I think extracting the kernel out and using a function barrier would at least look nicer, if it isn't strictly necessary. Scratch that, I see that they are all UnitRanges now...

we should promote the elements from u and v first before multiplying and adding.

I also just realized that muladd automatically promotes first.

galenlynch commented 4 months ago

These all seem like great changes!

martinholters commented 4 months ago

This turns out to be a but more of a rabbit hole than I had anticipated, but it's slowly getting into shape. Apart from some more tests and documentation, the main to-do appears to be a better heuristic to choose the fastest algorithm. But the working of the algorithm kwarg also may need some more consideration. Current status:

The default for algorithm is chosen based on the (output) eltype: :auto if it is Float32, Float64 or a Complex of those, :direct otherwise. Another option would be to have another keyword, say :fastest, to do what :auto does now, and have :auto be what the default does now (i.e. being equal to :direct or :fastest depending on eltype). Would the latter be better?

Another alternative would be to replace the algorithm keyword with something like allow_fft_based::Boolean where false corresponds to current :direct and true to current :auto, with the default again decided by eltype. Would that be preferable?

galenlynch commented 4 months ago

I think allowing the user to choose the algorithm would be better. In my previous attempts to add direct convolution to this and figure out which alg to choose, I found it very hard to come up with heuristics since dimensionality, size of the problem, element type, number of threads available, and architecture (eg avx instructions available) all come into play. Doing a best effort selection and letting users who really care override seems better imo.

martinholters commented 4 months ago

That would then include the option the choose between simple FFT and overlap-save, right? (I had this in an earlier iteration, can revive.)

mbaz commented 4 months ago

I agree with giving users the ability to choose specific algorithms. Maybe we could have fft for auto-selection, and fft_simple, fft_overlapsave to specify an algorithm.

wheeheee commented 4 months ago

I think the second option with :auto replacing the current default behaviour seems safer. Perhaps a simplified general-user-facing conv(u, v; mode=:auto(/:unsafe_fast)) would be friendlier, with :unsafe_fast enabling conversion to floats and FFT for numbers that aren't IEEE floats / complex. And, for those who need it, put the additional, more involved options in conv!(out, u, v; alg), also defining an allocate_output for convenience, where one can choose a specific algorithm, :fft, or the options in conv. The different keyword arguments would be confusing though.

martinholters commented 4 months ago

Maybe we could have fft for auto-selection, and fft_simple, fft_overlapsave to specify an algorithm.

Do you mean fft behave like the present auto, i.e. also allowing time-domain? That would be a bit surprising, so probably not. So I assume fft should mean "use one of the FFT-based algorithms, but I don't care which one". Sure, that can be done.

@wheeheee I don't think allowing different kwargs in conv and conv! for similar purposes would help simplify the interface for users in any way.

mbaz commented 4 months ago

So I assume fft should mean "use one of the FFT-based algorithms, but I don't care which one". Sure, that can be done.

Yes, that's what I meant.

martinholters commented 4 months ago

Ok, as there were no complaints about the current API, I've added docs and more tests. If someone could check the docs for comprehensibility, that would be welcome. Remaining to-do is to devise a heuristic for deciding between time and frequency domain.

martinholters commented 3 months ago

Sorry this has stalled. The only thing left to do from my perspective is to come up with a heuristic to choose between time-domain and frequency-domain. I meant to base this on some benchmarking, but that "some" turned out to be quite a lot, especially when not restricting to 1-d. For the time-domain algorithm, my benchmarking confirmed that C*length(u)*length(v) is a reasonable estimate for the runtime with C depending on dimensionality and element type. (Interestingly, dimensionality and element type seem to be non-orthogonal here, though.) But some preliminary benchmarking I did for the FFT convolution gave surprisingly inconclusive results.

If anyone wants to pick this up or has a good idea for a benchmarking strategy - any help is appreciated!

wheeheee commented 3 months ago

Not sure what the detailed benchmarks say, but would it be acceptable to start with the conservative heuristic here since it's IIUC a definite improvement for small sizes? Also saw this and tried it out, with muladd too; I think it fixes the problem with OffsetArrays not SIMDing as well as normal arrays. Hopefully @simd is safe for arbitrary eltypes. Another small problem: the padding heuristic using nextfastfft isn't always optimal, causing performance to have weird jumps sometimes.

julia> x = rand(ComplexF64, 245); rx = reverse(x);

julia> @btime conv($x, $rx; algorithm=:fft_simple);
  45.600 μs (13 allocations: 31.91 KiB)

julia> x = rand(ComplexF64, 246); rx = reverse(x);

julia> @btime conv($x, $rx; algorithm=:fft_simple);
  27.500 μs (13 allocations: 32.28 KiB)