not a big deal but I noticed that Julia's CUFFT is 50% slower if performing a FFT on a real valued array.
If the array is complex, both have same speed. Also rfft seems to work fine which is anyway recommended for a real valued array.
But sometimes you don't care and you just write plan_fft because you don't assume the input type.
julia> using CUDA, CUDA.CUFFT, BenchmarkTools
julia> function ff(sz)
xc = CUDA.rand(Float32, sz...)
p = plan_fft(xc, (1,2))
@benchmark CUDA.@sync $p * $xc
end
ff (generic function with 1 method)
julia> ff((256, 256, 256))
BenchmarkTools.Trial: 1577 samples with 1 evaluation.
Range (min … max): 3.132 ms … 24.120 ms ┊ GC (min … max): 0.00% … 98.37%
Time (median): 3.152 ms ┊ GC (median): 0.00%
Time (mean ± σ): 3.167 ms ± 528.090 μs ┊ GC (mean ± σ): 1.17% ± 3.55%
▁▁▂▄▆▆▄▅▅▆█▂▄▃▂▄ ▂▂▃ ▁ ▁
▁▁▁▂▁▂▂▂▂▂▄▄▄▆▆▇██████████████████████▇███▇▆▅▅▅▅▄▂▃▃▃▂▂▃▂▂▂ ▄
3.13 ms Histogram: frequency by time 3.17 ms <
Memory estimate: 4.09 KiB, allocs estimate: 161.
julia> ff((150, 150, 150))
BenchmarkTools.Trial: 7474 samples with 1 evaluation.
Range (min … max): 654.434 μs … 37.787 ms ┊ GC (min … max): 0.00% … 98.53%
Time (median): 659.324 μs ┊ GC (median): 0.00%
Time (mean ± σ): 666.257 μs ± 429.687 μs ┊ GC (mean ± σ): 1.40% ± 4.80%
▃▇█▇▄ ▁ ▁
█████████▅▅▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▃▄▆▆▆ █
654 μs Histogram: log(frequency) by time 774 μs <
Memory estimate: 4.09 KiB, allocs estimate: 161.
In [26]: xc = torch.rand(256, 256, 256).cuda()
In [27]: %%timeit
...: # xc = torch.rand(256, 256, 256).cuda()
...: torch.fft.fft2(xc)
...: torch.cuda .synchronize()
...:
...:
2.23 ms ± 783 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [28]: xc = torch.rand(150, 150, 150).cuda()
In [29]: %%timeit
...: # xc = torch.rand(256, 256, 256).cuda()
...: torch.fft.fft2(xc)
...: torch.cuda .synchronize()
...:
...:
466 µs ± 198 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
Input complex:
julia> function ff(sz)
xc = CUDA.rand(ComplexF32, sz...)
p = plan_fft(xc, (1,2))
@benchmark CUDA.@sync $p * $xc
end
ff (generic function with 1 method)
julia> ff((150, 150, 150))
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 345.697 μs … 46.356 ms ┊ GC (min … max): 0.00% … 98.87%
Time (median): 347.790 μs ┊ GC (median): 0.00%
Time (mean ± σ): 353.966 μs ± 460.525 μs ┊ GC (mean ± σ): 1.65% ± 3.04%
▁▄▆▇█▇▅▅▄▃
▂▂▂▃▄▅▇███████████▇▆▅▄▄▃▃▃▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▂▂▂▁▂▂▁▁▁▂ ▄
346 μs Histogram: frequency by time 356 μs <
Memory estimate: 672 bytes, allocs estimate: 21.
In [30]: xc = torch.rand(150, 150, 150).cuda() + 0j
In [31]: %%timeit
...: # xc = torch.rand(256, 256, 256).cuda()
...: torch.fft.fft2(xc)
...: torch.cuda .synchronize()
...:
...:
343 µs ± 168 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
Hi,
not a big deal but I noticed that Julia's CUFFT is 50% slower if performing a FFT on a real valued array. If the array is complex, both have same speed. Also
rfft
seems to work fine which is anyway recommended for a real valued array.But sometimes you don't care and you just write
plan_fft
because you don't assume the input type.Input complex:
Versions: