FFT on real array with `plan_fft(x)` is 50% slower than PyTorch

Hi,

not a big deal but I noticed that Julia's CUFFT is 50% slower if performing a FFT on a real valued array. If the array is complex, both have same speed. Also rfft seems to work fine which is anyway recommended for a real valued array.

But sometimes you don't care and you just write plan_fft because you don't assume the input type.

julia> using CUDA, CUDA.CUFFT, BenchmarkTools

julia> function ff(sz)
                  xc = CUDA.rand(Float32, sz...)
                  p = plan_fft(xc, (1,2))
                  @benchmark CUDA.@sync $p * $xc
              end
ff (generic function with 1 method)

julia> ff((256, 256, 256))

BenchmarkTools.Trial: 1577 samples with 1 evaluation.
 Range (min … max):  3.132 ms …  24.120 ms  ┊ GC (min … max): 0.00% … 98.37%
 Time  (median):     3.152 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   3.167 ms ± 528.090 μs  ┊ GC (mean ± σ):  1.17% ±  3.55%

                  ▁▁▂▄▆▆▄▅▅▆█▂▄▃▂▄ ▂▂▃ ▁  ▁                    
  ▁▁▁▂▁▂▂▂▂▂▄▄▄▆▆▇██████████████████████▇███▇▆▅▅▅▅▄▂▃▃▃▂▂▃▂▂▂ ▄
  3.13 ms         Histogram: frequency by time        3.17 ms <

 Memory estimate: 4.09 KiB, allocs estimate: 161.

julia> ff((150, 150, 150))

BenchmarkTools.Trial: 7474 samples with 1 evaluation.
 Range (min … max):  654.434 μs …  37.787 ms  ┊ GC (min … max): 0.00% … 98.53%
 Time  (median):     659.324 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   666.257 μs ± 429.687 μs  ┊ GC (mean ± σ):  1.40% ±  4.80%

  ▃▇█▇▄  ▁                                                      ▁
  █████████▅▅▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▃▄▆▆▆ █
  654 μs        Histogram: log(frequency) by time        774 μs <

 Memory estimate: 4.09 KiB, allocs estimate: 161.

In [26]: xc = torch.rand(256, 256, 256).cuda()

In [27]: %%timeit
    ...: # xc = torch.rand(256, 256, 256).cuda()
    ...: torch.fft.fft2(xc)
    ...: torch.cuda .synchronize()
    ...: 
    ...: 
2.23 ms ± 783 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [28]: xc = torch.rand(150, 150, 150).cuda()

In [29]: %%timeit
    ...: # xc = torch.rand(256, 256, 256).cuda()
    ...: torch.fft.fft2(xc)
    ...: torch.cuda .synchronize()
    ...: 
    ...: 
466 µs ± 198 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Input complex:

julia> function ff(sz)
                  xc = CUDA.rand(ComplexF32, sz...)
                  p = plan_fft(xc, (1,2))
                  @benchmark CUDA.@sync $p * $xc
              end
ff (generic function with 1 method)

julia> ff((150, 150, 150))
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  345.697 μs …  46.356 ms  ┊ GC (min … max): 0.00% … 98.87%
 Time  (median):     347.790 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   353.966 μs ± 460.525 μs  ┊ GC (mean ± σ):  1.65% ±  3.04%

         ▁▄▆▇█▇▅▅▄▃                                              
  ▂▂▂▃▄▅▇███████████▇▆▅▄▄▃▃▃▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▂▂▂▁▂▂▁▁▁▂ ▄
  346 μs           Histogram: frequency by time          356 μs <

 Memory estimate: 672 bytes, allocs estimate: 21.

In [30]: xc = torch.rand(150, 150, 150).cuda() + 0j

In [31]: %%timeit
    ...: # xc = torch.rand(256, 256, 256).cuda()
    ...: torch.fft.fft2(xc)
    ...: torch.cuda .synchronize()
    ...: 
    ...: 
343 µs ± 168 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Versions:

julia> versioninfo()
Julia Version 1.10.4
Commit 48d4fd48430 (2024-06-04 10:41 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 24 × AMD Ryzen 9 5900X 12-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, znver3)
Threads: 24 default, 0 interactive, 12 GC (on 24 virtual cores)
Environment:
  JULIA_NUM_THREADS = 24
  JULIA_MAX_NUM_PRECOMPILE_FILES = 100

julia> CUDA.versioninfo()
CUDA runtime 12.5, artifact installation
CUDA driver 12.3
NVIDIA driver 545.23.8

CUDA libraries: 
- CUBLAS: 12.5.2
- CURAND: 10.3.6
- CUFFT: 11.2.3
- CUSOLVER: 11.6.2
- CUSPARSE: 12.4.1
- CUPTI: 23.0.0
- NVML: 12.0.0+545.23.8

Julia packages: 
- CUDA: 5.4.2
- CUDA_Driver_jll: 0.9.0+0
- CUDA_Runtime_jll: 0.14.0+1

Toolchain:
- Julia: 1.10.4
- LLVM: 15.0.7

1 device:
  0: NVIDIA GeForce RTX 3060 (sm_86, 5.010 GiB / 12.000 GiB available)

In [32]: torch.__version__
Out[32]: '2.0.1+cu117'

JuliaGPU / CUDA.jl

FFT on real array with `plan_fft(x)` is 50% slower than PyTorch #2419