Polluted result using CUDA 3D rfft in Float32 data type

doraemonho commented 1 year ago

Hi, I am working on the 3D RFFT but find out the result of rfft using CUDA is a bit weird in Float32 .

I define a function F(k) in k-space as a gaussian function like e^{-(k-k0)^2}e^{-i\phi}, where k0 is some number and phi is a random number array between 0 and 2pi. Here is the implementation

function GetFk(grid;kf = 2,σ²= 1)
    kx,ky,kz = grid.kr,grid.l,grid.m;
    k⁻¹  = @. √(grid.invKrsq);
    k    = @. √(grid.Krsq);
    Fk   = @. √(exp(-(k.-kf)^2/σ²)/2/π)*k⁻¹;
    randN = typeof(Fk) <: Array ? Base.rand : CUDA.rand;
    eⁱᶿ¹ = exp.(im*randN(T,GPUgrid.nkr,GPUgrid.nl,GPUgrid.nm)*2*π);
    return Fk.*eⁱᶿ¹;
end

If I do a rfft on this function, the result of CUDA rfft will overlay a moiré like pattern like below:

Here is the implemention:

using PyPlot,CUDA, FourierFlows
using FFTW
using LinearAlgebra: mul!, ldiv!

T = Float32;
GPUgrid = ThreeDGrid(GPU(),128,2π;T=T);
GFkh = GetFk(GPUgrid;kf = 2,σ²= 1);

GFk  = CUDA.zeros(T,GPUgrid.nx,GPUgrid.ny,GPUgrid.nz);
ldiv!(GFk, GPUgrid.rfftplan, deepcopy(GFkh));

CPUgrid = ThreeDGrid(CPU(),128,2π;T=T);
CFkh = Array(GFkh);
CFk  = zeros(T,GPUgrid.nx,GPUgrid.ny,GPUgrid.nz);
ldiv!(CFk, CPUgrid.rfftplan, deepcopy(CFkh));

figure(figsize = (16,8))
subplot(121);title("GPU T =$T",size=16)
imshow(Array(GFk[:,:,1]));
subplot(122);title("CPU T =$T",size=16)
imshow(Array(CFk[:,:,1]))

However, if you change T from Float32 to Float64, the result would become normal again:

The problem itself is rare and would happen in this kind of function.

navidcy commented 1 year ago

Hi @doraemonho — Nice! I’ll try to have a look at it soon.

navidcy commented 1 year ago

Inside the GetFk function the GPUgrid is called instead the grid from the function’s argument. Is that the prob?

doraemonho commented 1 year ago

Thanks for taking a look!

Inside the GetFk function the GPUgrid is called instead the grid from the function’s argument. Is that the prob?

I rewrtie the function a bit to define the GPUgrid inside the function (showed below), but the result doesn't change.

function GetFk(N::Int,Lx::Number;T = Float32, dev = GPU(), kf = 2,σ²= 1)
    Fgrid    = ThreeDGrid(GPU(),N,Lx;T=T);
    kx,ky,kz = Fgrid.kr,Fgrid.l,Fgrid.m;
    k⁻¹  = @. √(Fgrid.invKrsq);
    k    = @. √(Fgrid.Krsq);
    Fk   = @. √(exp(-(k.-kf)^2/σ²)/2/π)*k⁻¹;
    randN = typeof(Fk) <: Array ? Base.rand : CUDA.rand;
    eⁱᶿ¹ = exp.(im*randN(T,Fgrid.nkr,Fgrid.nl,Fgrid.nm)*2*π);
    return Fk.*eⁱᶿ¹,Fgrid;
end

T = Float32;
N = 128
Lx= 2π
GFkh,GPUgrid = GetFk(N,Lx;T=T,kf = 8,σ²= 1);

GFk  = CUDA.zeros(T,GPUgrid.nx,GPUgrid.ny,GPUgrid.nz);
ldiv!(GFk, GPUgrid.rfftplan, deepcopy(GFkh));

CPUgrid = ThreeDGrid(CPU(),N,Lx;T=T);
CFkh = Array(GFkh);
CFk  = zeros(T,GPUgrid.nx,GPUgrid.ny,GPUgrid.nz);
ldiv!(CFk, CPUgrid.rfftplan, deepcopy(CFkh));

figure(figsize = (16,8))
subplot(121);title("GPU T =$T",size=16)
imshow(Array(GFk[:,:,1]));
subplot(122);title("CPU T =$T",size=16)
imshow(Array(CFk[:,:,1]))

navidcy commented 1 year ago

I’ll have a closer look in the morning from my laptop

doraemonho commented 1 year ago

It occurs to me that, the root of this problem is coming from the fft plan itself. Now, I don't even use grid fft plan itself but define a rfft plan using FFTW, I can reproduce the polluted result

T = Float32;
GFkh_32,GPUgrid = GetFk(N,Lx;T=T,kf = 8,σ²= 1);
GFk32  = CUDA.zeros(T,GPUgrid.nx,GPUgrid.ny,GPUgrid.nz);
GPUrfftplan_F32 = plan_rfft(GFk32);
ldiv!(GFk32, GPUrfftplan_F32, deepcopy(GFkh_32));

figure(figsize = (16,8))
subplot(121);title("GPU T =$T",size=16)
imshow(Array(GFk32[:,:,1]));

T = Float64;
GFkh_64,GPUgrid = GetFk(N,Lx;T=T,kf = 8,σ²= 1);
GFk64  = CUDA.zeros(T,GPUgrid.nx,GPUgrid.ny,GPUgrid.nz);
GPUrfftplan_F64 = plan_rfft(GFk64);
ldiv!(GFk64, GPUrfftplan_F64, deepcopy(GFkh_64));

subplot(122);title("GPU T =$T",size=16)
imshow(Array(GFk64[:,:,1]))

I am not sure if this is a FFTW.jl 's or CUDA.jl 's problem.

doraemonho commented 1 year ago

Quick Update here After a long trial and error, it seems that I am finally arriving to the root of problem. It may has something to do with the fft plan of 2^n 3D real transform. If you try switch the resolution of any 2^n 3D Map (128^3 in our case) to other even number (130^3). The discrepancy between CPU and GPU rfft would vanish. I suspect there may have a bug on the algorithm of implementing 2^n FP32 rfft but I am not sure if this is a FFTW problem or CUDA problem

navidcy commented 1 year ago

Can we make sure it's not a plotting issue?

I tried to do some FFTs back and forth and couldn't find any disagreement between different data types. Have a look:

Here's an example just with FFTW and CUDA

CUDA + FFTW

```julia using CUDA using FFTW using LinearAlgebra: mul!, ldiv! using Test for T in [Float64, Float32] nx, ny, nz = 6, 8, 10 nk, nl, nm = Int(nx//2)+1, ny, nz a = randn(T, (nx, ny, nz)) a_cu = CuArray(a) ah = randn(Complex{T}, (nk, nl, nm)) ah_cu = CuArray(ah) plan = plan_rfft(a) plan_cu = plan_rfft(a_cu) bh = zeros(Complex{T}, (nk, nl, nm)) bh_cu = CuArray(bh) c = zeros(T, (nx, ny, nz)) c_cu = CuArray(c) mul!(bh, plan, a) ldiv!(c, plan, deepcopy(bh)) mul!(bh_cu, plan_cu, a_cu) ldiv!(c_cu, plan_cu, deepcopy(bh_cu)) @testset "CPU test for $T" begin @test bh == rfft(a) @test c == irfft(bh, nx) end @testset "GPU test for $T" begin @test bh_cu == rfft(a_cu) @test c_cu == irfft(bh_cu, nx) end end ```

and here's an example with CUDA + FourierFlows

CUDA + FourierFlows

```julia using CUDA using LinearAlgebra: mul!, ldiv! using FourierFlows using Test for T in [Float64, Float32] nx, ny, nz = 6, 8, 10 grid = ThreeDGrid(nx, nx, ny, ny, nz, nz; T) grid_cu = ThreeDGrid(GPU(), nx, nx, ny, ny, nz, nz; T) a = randn(T, (grid.nx, grid.ny, grid.nz)) a_cu = CuArray(a) ah = randn(Complex{eltype(grid)}, (grid.nkr, grid.nl, grid.nm)) ah_cu = CuArray(ah) plan = grid.rfftplan plan_cu = grid_cu.rfftplan bh = zeros(Complex{eltype(grid)}, (grid.nkr, grid.nl, grid.nm)) bh_cu = CuArray(bh) c = zeros(eltype(grid), (grid.nx, grid.ny, grid.nz)) c_cu = CuArray(c) mul!(bh, plan, a) ldiv!(c, plan, deepcopy(bh)) mul!(bh_cu, plan_cu, a_cu) ldiv!(c_cu, plan_cu, deepcopy(bh_cu)) @testset "CPU test for $T" begin @test bh == rfft(a) @test c == irfft(bh, nx) end @testset "GPU test for $T" begin @test bh_cu == rfft(a_cu) @test c_cu == irfft(bh_cu, nx) end end ```

All test pass for me. Same if I choose nx = ny = nz = 128.

doraemonho commented 1 year ago

Can we make sure it's not a plotting issue?

I tried to do some FFTs back and forth and couldn't find any disagreement between different data types. Have a look:

Here's an example just with FFTW and CUDA

CUDA + FFTW and here's an example with CUDA + FourierFlows

CUDA + FourierFlows All test pass for me. Same if I choose nx = ny = nz = 128.

Thanks for checking! Same result for me but I think this problem would only occurs if we satisfy the following condition :

Doing rfft from k-space to real space
The function that we are transferring is special ( like the spectral filtered e^{i\phi} k^{-2} )

I quite certain the problem is not coming the plot as the radial spectrum of FP32-CUDA found a spike at large k number. I cannot check the result using your method as the FFT result is a little different (~1e-18) between GPU and CPU result. However, I come up a alternative by using \approx instead of == inside the test. The Float64 is fine but fail in Float32.

using CUDA
using FFTW
using LinearAlgebra: mul!, ldiv!
using FourierFlows
using Test

nx, ny, nz = 128, 128, 128;       # define size of the array
θ = rand(θ,div(nx,2)+1, ny, nz)*2π; # define \theta \in [0,2\pi]

for T in [Float64,Float32]
    grid = ThreeDGrid(nx, nx, ny, ny, nz, nz; T)             
    grid_cu = ThreeDGrid(GPU(), nx, nx, ny, ny, nz, nz; T)

    θT  = convert(Array{T,3},θ)
    eⁱᶿk⁻² = @. exp.(im*θT)*grid.invKrsq;     #define the spectral filtered function in CPU
    eⁱᶿk⁻²_cu = CuArray(eⁱᶿk⁻²);               #define the spectral filtered function in GPU

    plan = grid.rfftplan
    plan_cu = grid_cu.rfftplan

    c = zeros(eltype(grid), (grid.nx, grid.ny, grid.nz))
    c_cu = CuArray(c);

    ldiv!(c, plan, deepcopy(eⁱᶿk⁻²))    
    ldiv!(c_cu, plan_cu, deepcopy(eⁱᶿk⁻²_cu))
    c_cu = Array(c_cu); #move the result from GPU-> CPU

    test_title = "eⁱᶿk⁻² CPU & GPUtest for " * string(T)

    @testset "$test_title" begin
        #Compare the result using ≈
        @test c_cu  ≈  c
    end 
end

navidcy commented 1 year ago

Can you print the types of eⁱᶿk⁻² and eⁱᶿk⁻²_cu to make sure they are correct type?

navidcy commented 1 year ago

Before and after the conversion to Array for eⁱᶿk⁻²_cu

doraemonho commented 1 year ago

Before and after the conversion to Array for eⁱᶿk⁻²_cu

do you mean before/after the rift or before the testing process? This is my type checking code and the result, as you can see, FP64 is fine but FP32 just fail

using CUDA
using FFTW
using LinearAlgebra: mul!, ldiv!
using FourierFlows
using Test

nx, ny, nz = 128, 128, 128;       # define size of the array
θ = rand(div(nx,2)+1, ny, nz)*2π; # define \theta \in [0,2\pi]

for T in [Float64,Float32]
    grid = ThreeDGrid(nx, nx, ny, ny, nz, nz; T)             
    grid_cu = ThreeDGrid(GPU(), nx, nx, ny, ny, nz, nz; T)

    θT  = convert(Array{T,3},θ)
    eⁱᶿk⁻² = @. exp.(im*θT)*grid.invKrsq;     #define the spectral filtered function in CPU
    eⁱᶿk⁻²_cu = CuArray(eⁱᶿk⁻²);              #define the spectral filtered function in GPU

    println("before the rfft")
    @show typeof(eⁱᶿk⁻²)
    @show typeof(eⁱᶿk⁻²_cu)

    plan = grid.rfftplan
    plan_cu = grid_cu.rfftplan

    c = zeros(eltype(grid), (grid.nx, grid.ny, grid.nz))
    c_cu = CuArray(c);

    ldiv!(c, plan, deepcopy(eⁱᶿk⁻²))    
    ldiv!(c_cu, plan_cu, deepcopy(eⁱᶿk⁻²_cu))
    c_cu = Array(c_cu); #move the result from GPU-> CPU

    println("before the test")
    @show typeof(c)
    @show typeof(c_cu)

    test_title = "eⁱᶿk⁻² CPU & GPU test for " * string(T)

    @testset "$test_title" begin
        #Compare the result using ≈
        @test c_cu  ≈  c
    end 
end

navidcy commented 1 year ago

@doraemonho sorry for slacking... I went out on leave and now slowly coming back.

Did you have any progress on this or is it still outstanding?

doraemonho commented 1 year ago

@doraemonho sorry for slacking... I went out on leave and now slowly coming back.

Did you have any progress on this or is it still outstanding?

@navidcy no worries, I still have this issue. The way of circumventing it is to change the resolution from 2^n to 10*n or switch to FP64 calculation. The problem only occurs in this kind of function

doraemonho commented 1 year ago

Just an update from what we were : -) This may be a CuFFT bug that would occur after the version CuFFT v.10.3. I reported it to NVIDIA and just heard back from them. They confirmed the issue and their CuFFT team is taking a look into it right now. Not sure when will be solved but I will keep updating when I get back from them

navidcy commented 1 year ago

@doraemonho thanks for this!

nvlcambier commented 1 year ago

Hi there! I'm working on cuFFT and following up from https://github.com/JuliaGPU/CUDA.jl/issues/1656 (internal NVIDIA bug# 3847913).

Sorry to hijack your issues - I thought it would be easier and faster to chat here directly. Let me know if you'd prefer to chat somewhere else (either https://github.com/JuliaGPU/CUDA.jl/issues/1656 or in the bug ticket).

Do you have more information about what exactly is the input to the complex-to-real transform that is being done by cuFFT ?

The reason why this matters is the following.

When doing a DFT of length N with a real input, if x is the output and N is even:

x[0] is real;
x[N/2] is real;
x[i] = x[N-i]* (https://en.wikipedia.org/wiki/Discrete_Fourier_transform#DFT_of_real_and_purely_imaginary_signals)

Given this, in practice, real-to-complex transforms work on N real inputs and produce N/2+1 complex outputs. The complex output is such that the first two properties above are respected.

Complex-to-real transforms take N/2+1 complex input (y) and produce N real output and assume that

y[0] is real;
y[N/2] is real in memory.

One might assume that imag(y[0]) or imag(y[N/2]) are not even read by the kernels. But in some (very few) case it is, because it can improve performance. This is usually not a problem because the input to a complex-to-real FFT is usually the output of a real-to-complex FFT, which satisfies this property.

What I suspect is that your input doesn't satisfy one of the above properties. I've done some profiling of the Julia code and in the nx=128, FP32 case, I see GPU kernels which assume the above properties.

In https://github.com/JuliaGPU/CUDA.jl/issues/1656 you mentionned that

This doesn't happen for FP64 --> this is because FP64 transforms don't use this algorithm
This doesn't happen for nx=130 --> this is because only powers of 2 use this algorithm

I've attached a cpp repro that shows this: when the last entry of the N/2+1 complex input to the complex-to-real FFT is not real (i.e., the imaginary part is non zero), in some cases (powers of 2 with enough batches on new-enough GPUs), the output will be wrong. If you compile and run it, for instance with CUDA 11.8, you will see

Symmetry error: 2.402546e-07
Error C2R output vs C2C output (with mistake ? 0): 1.307564e-07
Last value of input_c2r set to {6.389053,5.000000} instead of {6.389053, 0}
Error C2R output vs C2C output (with mistake ? 1): 1.566945e-02

I realize this is a subtle issue. It might also have nothing to do with this, but I thought this was worth a shot.

Let me know if this helps,

repro.tar.gz

doraemonho commented 1 year ago

Hi there! I'm working on cuFFT and following up from [JuliaGPU/CUDA.jl#1656] .....

Hi, thanks for checking it out! As the problem started from this thread, I think we may better chat here?

Do you have more information about what exactly is the input to the complex-to-real transform that is being done by cuFFT?

The input is just a random phase function $e^{i\phi}k\^{-2}$. The first term is just Euler's formula with some random phase angle bounded between 0 and $2\pi$, and $k\^{-2}$ can be referred to $(\sqrt{3}\times N\times x_i )^{-2}$, where $x_i$ is discrete coordinate number bound between 0 and 1. This is one classical way of injecting turbulence energy in fluid simulation.

We always set $k\^{-2} =0$ when $x_i = 0$ to avoid singularity but other terms are nonzeros and remain imaginary. So I think y[0] = 0 doesn't hold in our case. I work out a test of setting $y[0] = 0$ and it works! (Code showed at bottom)

I have one more question to ask. I was trying CuFFT v.10.3.0 for the same setup but the result was fine. However, the symptom only occurs in the newer version of CuFFT library. Are complex to real transform algorithm changes during the recent version?

using CUDA
using FFTW
using Random
using LinearAlgebra: mul!, ldiv!
using Test

function getk⁻²_and_rfftplan(dev::Dev; 
                             nx = 64, ny = nx, nz = nx, Lx = 2π, Ly = Lx, Lz = Lx,
                             T = Float32, nthreads=Sys.CPU_THREADS, effort=FFTW.MEASURE) where Dev
    ArrayType = dev == "CPU" ? Array : CuArray;

    # Define the rfft parameters
    nk = nx
    nl = ny
    nm = nz
    nkr = Int(nx/2 + 1)  
    # Wavenubmer
    k = ArrayType{T}(reshape( fftfreq(nx, 2π/Lx*nx), (nk, 1, 1)))
    l = ArrayType{T}(reshape( fftfreq(ny, 2π/Ly*ny), (1, nl, 1)))
    m = ArrayType{T}(reshape( fftfreq(nz, 2π/Lz*nz), (1, 1, nm)))
    kr = ArrayType{T}(reshape(rfftfreq(nx, 2π/Lx*nx), (nkr, 1, 1)))

       Ksq = @. k^2 + l^2 + m^2
    invKsq = @. 1 / Ksq
    CUDA.@allowscalar  invKsq[1, 1, 1] = 0;

       Krsq = @. kr^2 + l^2 + m^2
    invKrsq = @. 1 / Krsq
    CUDA.@allowscalar invKrsq[1, 1, 1] = 0;

    FFTW.set_num_threads(nthreads);
    rfftplan = plan_rfft(ArrayType{T, 3}(undef, nx, ny, nz))

    return invKrsq,rfftplan
end

# Set up the initial condition
Random.seed!(1234);        # Set up the random seed
θ = rand(div(nx,2)+1, ny, nz)*2π; # define \theta \in [0,2\pi]
nx = ny = nz= 128
T = Float32

  k⁻²,  rfftplan = getk⁻²_and_rfftplan("CPU"; nx = nx, Lx = 2π, ny = ny, nz = ny, T =  T)             
k⁻²_c,rfftplan_c = getk⁻²_and_rfftplan("GPU"; nx = nx, Lx = 2π, ny = ny, nz = ny, T =  T)

θT  = im.*convert(Array{T,3},θ)
eⁱᶿk⁻² = @. exp.(θT)*k⁻²;        #define the spectral filtered function in CPU
eⁱᶿk⁻²_cu = CuArray(eⁱᶿk⁻²);     #define the spectral filtered function in GPU

c_b4_cleaning = CUDA.zeros(T, nx, ny, nz);
c_af_cleaning = CUDA.zeros(T, nx, ny, nz);

# Work out the rfft before cleaning 0 terms
ldiv!(c_b4_cleaning, rfftplan_c, deepcopy(eⁱᶿk⁻²_cu));

#Clean y[N/2] terms (index 0 = 1 julia)
@. eⁱᶿk⁻²_cu[1,:,:] .= 0;
ldiv!(c_af_cleaning, rfftplan_c, deepcopy(eⁱᶿk⁻²_cu));

c_b4_cleaning = Array(c_b4_cleaning); #move the result from GPU-> CPU
c_af_cleaning = Array(c_af_cleaning); #move the result from GPU-> CPU

figure(figsize=(12,6))
subplot("121");
imshow(c_b4_cleaning[:,:,100]);title("Before cleaning y[1,:,:]")
subplot("122");
imshow(c_af_cleaning[:,:,100]);title("After cleaning y[1,:,:]")

nvlcambier commented 1 year ago

Are complex to real transform algorithm changes during the recent version?

Yes, there are occasional changes. I doesn't surprise me that you didn't see this behavior in older versions.

I work out a test of setting and it works! (Code showed at bottom)

Awesome!

If that solves your problem, we'd appreciate if you can follow up on the Bug ticket.

In think in summary

If you want to do a complex-to-real transform, ensure the first and last element of the complex input of length (N/2)/1 are real, otherwise the result is essentially undefined;
If you can't have those elements real, I would just do a complex-to-complex with a complex input of length N. But then the output won't necessarily be real.

doraemonho commented 1 year ago

Yes, there are occasional changes. I doesn't surprise me that you didn't see this behavior in older versions.

Thanks for the information! I appreciate your help. I would follow the ticket then.

@navidcy would you think we should keep this issue open or just close it?

navidcy commented 1 year ago

Let’s keep it open for now so that people can see that there is an issue. Perhaps we should issue a warning if someone makes a ThreeDGrid w Float32 on GPU. The issue is only for 3D, right?

doraemonho commented 1 year ago

Let’s keep it open for now so that people can see that there is an issue. Perhaps we should issue a warning if someone makes a ThreeDGrid w Float32 on GPU. The issue is only for 3D, right?

From the description, it seems to apply to all cases(1D/2D/3D) as long as the Fourier transformed $x{1}$ and $x{N/2+1}$ is not 0 before they do the Complex to Real transform.

nvlcambier commented 1 year ago

This applies to 1D, 2D and 3D yes.

If the imaginary part of x[0] and x[N/2+1] is not zero, computing a complex-to-real DFT doesn't really make sense since the output of the DFT isn't guaranteed to be real. If you still want to compute the DFT of such signal, I would suggest to do a complex-to-complex DFT. But the output might not be real.

FYI, this is also documented in section 2.3 of cuFFT's online documentation

https://docs.nvidia.com/cuda/cufft/index.html#fft-types

navidcy commented 1 year ago

I'm confused a bit.

If we start with a real signal and Fourier transform it then the components that correspond to 0-th frequency and to $N/2+1$-frequency aren't they real?

navidcy commented 1 year ago

$ julia --project
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.8.2 (2022-09-29)
 _/ |\__'_|_|_|\__'_|  |
|__/                   |

julia> using LinearAlgebra

julia> using CUDA

julia> using CUDA.CUFFT

julia> a = cu(rand(8))
8-element CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}:
 0.4507753
 0.06592999
 0.96525353
 0.6843841
 0.94825345
 0.046085697
 0.38072583
 0.8913641

julia> rfft(a)
5-element CuArray{ComplexF32, 1, CUDA.Mem.DeviceBuffer}:
   4.4327717f0 + 0.0f0im
 -0.33708918f0 - 0.4522028f0im
 0.053049427f0 + 1.4637324f0im
 -0.65786713f0 + 0.71685266f0im
   1.0572441f0 + 0.0f0im

julia> ah =complex(cu(zeros(5)))
5-element CuArray{ComplexF32, 1, CUDA.Mem.DeviceBuffer}:
 0.0f0 + 0.0f0im
 0.0f0 + 0.0f0im
 0.0f0 + 0.0f0im
 0.0f0 + 0.0f0im
 0.0f0 + 0.0f0im

julia> rfftplan = plan_rfft(CuArray{Float32, 1}(undef, 8))
CUFFT real-to-complex forward plan for 8-element CuArray of Float32

julia> mul!(ah, rfftplan, a)
5-element CuArray{ComplexF32, 1, CUDA.Mem.DeviceBuffer}:
   4.4327717f0 + 0.0f0im
 -0.33708918f0 - 0.4522028f0im
 0.053049427f0 + 1.4637324f0im
 -0.65786713f0 + 0.71685266f0im
   1.0572441f0 + 0.0f0im

julia> b = deepcopy(a*0)
8-element CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}:
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0

julia> ldiv!(b, rfftplan, ah)
8-element CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}:
 0.45077527
 0.06593
 0.9652535
 0.68438405
 0.9482534
 0.046085723
 0.38072583
 0.89136404

navidcy commented 1 year ago

(same results in 2D and 3D...)

navidcy commented 1 year ago

OK, I see... the issue comes when we generate the ah and not take ah as the rFFT of a real array. That's perfectly reasonable and indeed it's something expected.

But why that's not an issue with Float64?

nvlcambier commented 1 year ago

Because cuFFT FP64 transforms use (slightly) different algorithms, so the issue is not visible.

navidcy commented 12 months ago

I guess it seems that a warning should at least be displayed when user does give an N/2-Fourier component with non-zero imaginary part? I guess by CUDA.jl?

FourierFlows / FourierFlows.jl

Polluted result using CUDA 3D rfft in Float32 data type #326