Open martinholters opened 4 months ago
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 97.79%. Comparing base (
76001f4
) to head (b127ca2
).
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Should tdconv
be exported just like tdfilt
? Then users could more easily choose the cutoff themselves.
maybe
Val(N)
would help with type stability
It does for N>10
, which may not be too relevant, but then again ... why not?
Should tdconv be exported just like tdfilt? Then users could more easily choose the cutoff themselves.
Hm. Export tdconv
and fdconv
(names subject to bike-shedding) and have conv
do some best-effort choosing among them? Or add kwarg to conv
to force a choice instead of the additional exports?
I kind of like the names tdconv
and fdconv
, but maybe adding a kwarg is better, since people might also want to choose the overlap-save algorithm and that's one more function.
I've implemented the keyword-argument-based algorithm selection. I'm not entirely sure I like it, but the first impression is okayish.
The algorithm
set to :auto
means "select based on problem size", which is the default for types known to be reasonably FFT-able (Float32
, Float64
, and their Complex
versions). For other types the default is :direct
. This lets user opt into :auto
for other types at their own risk, but provides a sane default.
I took reference to FastConv.jl (licensed under the "MIT License (Expat)") and wrote a version with an explicit loop over CartesianIndices
for 1-based arrays (unsure about offset indices). Here it is:
function _tdconv1!(out, E::AbstractArray{<:Number,N}, k::AbstractArray{<:Number,N}) where {N}
output_indices = CartesianIndices(ntuple(i -> 1:(size(E, i)+size(k, i)-1), Val(N)))
checkbounds(out, output_indices)
fill!(out, 0)
offset = CartesianIndex(ntuple(_ -> 1, Val(N)))
for j in CartesianIndices(E), i in CartesianIndices(k)
out[i+j-offset] = muladd(E[j], k[i], out[i+j-offset])
end
return out
end
Benchmarks:
julia> x = rand(100); y = rand(100); out = rand(199);
julia> @benchmark _conv_td!($out, $x, $y)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 9.000 μs … 38.000 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 9.100 μs ┊ GC (median): 0.00%
Time (mean ± σ): 9.216 μs ± 602.871 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
█ █ ▃ ▂
▅▁▁▁█▁▁▁▁█▁▁▁▁█▁▁▁▁▃▁▁▁▁▁▁▁▁▁▃▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▇▁▁▁▁█▁▁▁█ █
9 μs Histogram: log(frequency) by time 10.2 μs <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark _tdconv1!($out, $x, $y)
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
Range (min … max): 1.510 μs … 5.720 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 1.520 μs ┊ GC (median): 0.00%
Time (mean ± σ): 1.583 μs ± 101.580 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
█ ▂
▄▁▁█▁▁▃▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▁▁▄▁▁█▁▁▂ ▂
1.51 μs Histogram: frequency by time 1.69 μs <
Memory estimate: 0 bytes, allocs estimate: 0.
Yep. I used the generator-based version because it allowed to derive the output type based on the actual values. With the latest change from _conv_td
to _conv_td!
, a simple loop like you have provided is probably better. Once we're sure about the API, I'll try to optimize, probably by switching to such a loop.
I've implemented the keyword-argument-based algorithm selection. I'm not entirely sure I like it, but the first impression is okayish.
If you're referring to the somewhat messy organization in conv
, rather than the heuristic used or the ergonomics of the keyword arguments, I have a commit ready to fix that up, if you'd like.
If you're referring to the somewhat messy organization in conv, rather than the heuristic used or the ergonomics of the keyword arguments, I have a commit ready to fix that up, if you'd like.
I was primarily referring to the keyword-based API. Once that's settled, the organization of the conv
methods can be streamlined. And the heuristic when to choose time vs frequency domain is mostly a placeholder for now, that needs adjustment either way, but that's somewhat orthogonal.
So let's discuss API. Some random thoughts I have gathered while working on this:
conv(u,v)
do something reasonable by default. To me, that means at least to do small convolutions for non-FFT-able datatypes in time domain and large convolutions of FFT-able datatype in frequency domain. For large integer convolutions, it is not clear whether the long runtime of time domain or the rounding issues of frequency domain are the lesser evil, so an automatic mode that only considers problem size but ignores input types might also be desirable.conv
, the result type should be independent of the algorithm, and I guess the only thing take really makes sense is eltype(conv(u,v)) == promote_type(eltype(u), eltype(v))
.conv
as first allocating the output and then calling a conv!
function, we should consider making the latter public.collect
machinery to derive the output eltype based on the actual values. That may have applications in rare situations, but I wouldn't want to make the API more complicated just to allow this.u
and v
in the light of non-commutative number types.I think the current kwargs approach addresses the first three items, although choosing between :auto
and :direct
based on the eltype might come a bit surprising. I'm not sure I like that. The alternative would be two different auto-like keywords, I guess.
My opinion, for what it's worth:
conv!
welcome, totally in favour of that.@warn
for non-explicitly-allowed types could be warranted since that can really slow things down, though I'm not sure if @warn
does bad things even on the slow path...:direct
by default, so path 2 in #410, with another warning if the inputs aren't reasonably small. This would obviate the need for another keyword argument, if I understand you correctly, and another benefit, as was mentioned in that issue, is conforming with SciPy's behaviour.conv!
is in use, and the output array is of a wider type (although again, not too sure how to determine that), we should promote the elements from u
and v
first before multiplying and adding.I'm no fan of @warn
ings, unless there is also an option to say "I know what I'm doing, no need to flood me with warnings". And including that option would make the API somewhat clunky, in my opinion.
Otherwise, this is slowly getting shape -- partially in code, partially in my head for now. One thing I haven't made up mind about is how conv!
should deal with offset axes. A short recap for conv
:
OneTo
s, so will be the output axes, and if the input indices are $1,\ldots,M$ and $1,\ldots,N$, the output indices will be $1,\ldots,M+N-1$ as a OneTo
(for every axis).OneTo
axis, and if the input indices are $0,\ldots,M-1$ and $0,\ldots,N-1$, the output indices will be $0,\ldots,M+N-2$ (for every axis). Shifts in the input are reflected in the output, so if the input indices are $1,\ldots,M$ and $1,\ldots,N$, the output indices will be $2,\ldots,M+N$. This directly matches the $y[n]=\sum v[m]\cdot u[n-m]$ definition.The two cases are inconsistent with each other for the input-indices-start-at-1 case, but each one individually is very much the only reasonable choice, so that's where we are. Now if the output provided to conv!
matches what conv
returns, all is well, but what if not. If the output has a OneTo
axis, just storing the results starting at 1 seems ok, I'd say. If the output and inputs have offsets, but they don't "match", if all required indices are present in the output, storing the result there and zeroing the rest sounds reasonable. If the result does not fit into the output, a BoundsError
seems warranted. A silent truncation might also be considered. But what if all inputs are OneTo
s but the output is not? Put the results starting at 2, starting at 1, or error?
Of course, erroring for unclear cases would open the possibility to decide later on, so that might be the safest route for now.
Sounds good. Another thought I had regarding commutativity and axes, while reading this after prematurely micro-optimizing the loop: even if we consider *
potentially non-commutative, I suspect we could still retain the performance benefits of swapping u
and v
by simply changing the iteration order of the CartesianIndices
.
However, as it is, there is the potential for type instability if the axes of each Scratch that, I see that they are all CartesianIndices
are not of the same type, although union splitting may automatically take care of that. If this is implemented, I think extracting the kernel out and using a function barrier would at least look nicer, if it isn't strictly necessary.UnitRange
s now...
we should promote the elements from
u
andv
first before multiplying and adding.
I also just realized that muladd
automatically promotes first.
These all seem like great changes!
This turns out to be a but more of a rabbit hole than I had anticipated, but it's slowly getting into shape. Apart from some more tests and documentation, the main to-do appears to be a better heuristic to choose the fastest algorithm. But the working of the algorithm
kwarg also may need some more consideration. Current status:
:direct
chooses time-domain direct convolution.:fft
chooses an FFT-based algorithm; whether simple or overlap-save is chosen automatically. Should we allow the caller to choose?:auto
picks the faster (well, presently modulo the shortcomings of the super-simple heuristic) of those.The default for algorithm
is chosen based on the (output) eltype: :auto
if it is Float32, Float64
or a Complex
of those, :direct
otherwise. Another option would be to have another keyword, say :fastest
, to do what :auto
does now, and have :auto
be what the default does now (i.e. being equal to :direct
or :fastest
depending on eltype). Would the latter be better?
Another alternative would be to replace the algorithm
keyword with something like allow_fft_based::Boolean
where false
corresponds to current :direct
and true
to current :auto
, with the default again decided by eltype. Would that be preferable?
I think allowing the user to choose the algorithm would be better. In my previous attempts to add direct convolution to this and figure out which alg to choose, I found it very hard to come up with heuristics since dimensionality, size of the problem, element type, number of threads available, and architecture (eg avx instructions available) all come into play. Doing a best effort selection and letting users who really care override seems better imo.
That would then include the option the choose between simple FFT and overlap-save, right? (I had this in an earlier iteration, can revive.)
I agree with giving users the ability to choose specific algorithms. Maybe we could have fft
for auto-selection, and fft_simple
, fft_overlapsave
to specify an algorithm.
I think the second option with :auto
replacing the current default behaviour seems safer.
Perhaps a simplified general-user-facing conv(u, v; mode=:auto(/:unsafe_fast))
would be friendlier, with :unsafe_fast
enabling conversion to floats and FFT for numbers that aren't IEEE floats / complex.
And, for those who need it, put the additional, more involved options in conv!(out, u, v; alg)
, also defining an allocate_output
for convenience, where one can choose a specific algorithm, :fft
, or the options in conv
. The different keyword arguments would be confusing though.
Maybe we could have
fft
for auto-selection, andfft_simple
,fft_overlapsave
to specify an algorithm.
Do you mean fft
behave like the present auto
, i.e. also allowing time-domain? That would be a bit surprising, so probably not. So I assume fft
should mean "use one of the FFT-based algorithms, but I don't care which one". Sure, that can be done.
@wheeheee I don't think allowing different kwargs in conv
and conv!
for similar purposes would help simplify the interface for users in any way.
So I assume
fft
should mean "use one of the FFT-based algorithms, but I don't care which one". Sure, that can be done.
Yes, that's what I meant.
Ok, as there were no complaints about the current API, I've added docs and more tests. If someone could check the docs for comprehensibility, that would be welcome. Remaining to-do is to devise a heuristic for deciding between time and frequency domain.
Sorry this has stalled. The only thing left to do from my perspective is to come up with a heuristic to choose between time-domain and frequency-domain. I meant to base this on some benchmarking, but that "some" turned out to be quite a lot, especially when not restricting to 1-d. For the time-domain algorithm, my benchmarking confirmed that C*length(u)*length(v)
is a reasonable estimate for the runtime with C
depending on dimensionality and element type. (Interestingly, dimensionality and element type seem to be non-orthogonal here, though.) But some preliminary benchmarking I did for the FFT convolution gave surprisingly inconclusive results.
If anyone wants to pick this up or has a good idea for a benchmarking strategy - any help is appreciated!
Not sure what the detailed benchmarks say, but would it be acceptable to start with the conservative heuristic here since it's IIUC a definite improvement for small sizes?
Also saw this and tried it out, with muladd
too; I think it fixes the problem with OffsetArray
s not SIMDing as well as normal arrays. Hopefully @simd
is safe for arbitrary eltypes.
Another small problem: the padding heuristic using nextfastfft
isn't always optimal, causing performance to have weird jumps sometimes.
julia> x = rand(ComplexF64, 245); rx = reverse(x);
julia> @btime conv($x, $rx; algorithm=:fft_simple);
45.600 μs (13 allocations: 31.91 KiB)
julia> x = rand(ComplexF64, 246); rx = reverse(x);
julia> @btime conv($x, $rx; algorithm=:fft_simple);
27.500 μs (13 allocations: 32.28 KiB)
This does the very minimal thing to get discussion started. I think we should do the time-domain convolution in more cases, but before going in that direction, I'd first like to get feedback.
In particular, for "small"
u
andv
, time-domain convolution would be more efficient and I see no reason not to do it. It just takes a bit of benchmarking to figure out what "small" should be exactly, here. And it would also fix this minor issue:The time-domain convolution form this PR does the right thing:
Another point worth considering is whether to use time-domain convolution for more input types. Consider
For
BigFloat
, this would change from throwing an error to working, but potentially slow, which I'd consider an improvement. ForRational
, I wonder whether doing the computation onRational
s (and also returningRational
s) wouldn't be better. The extreme measure would be to do time-domain convolution by default and only use FFT-based approaches forFloat32
andFloat64
input (above a certain problem size), which would still cover most cases in practice, I guess. Opinions?Note that if we decide to do time-domain convolution in more cases, we can always do that in later PRs; this one could stand on its own.
Closes #411, fixes #410.