Open doraemonho opened 1 year ago
Hi @doraemonho — Nice! I’ll try to have a look at it soon.
Inside the GetFk function the GPUgrid is called instead the grid from the function’s argument. Is that the prob?
Thanks for taking a look!
Inside the GetFk function the GPUgrid is called instead the grid from the function’s argument. Is that the prob?
I rewrtie the function a bit to define the GPUgrid inside the function (showed below), but the result doesn't change.
function GetFk(N::Int,Lx::Number;T = Float32, dev = GPU(), kf = 2,σ²= 1)
Fgrid = ThreeDGrid(GPU(),N,Lx;T=T);
kx,ky,kz = Fgrid.kr,Fgrid.l,Fgrid.m;
k⁻¹ = @. √(Fgrid.invKrsq);
k = @. √(Fgrid.Krsq);
Fk = @. √(exp(-(k.-kf)^2/σ²)/2/π)*k⁻¹;
randN = typeof(Fk) <: Array ? Base.rand : CUDA.rand;
eⁱᶿ¹ = exp.(im*randN(T,Fgrid.nkr,Fgrid.nl,Fgrid.nm)*2*π);
return Fk.*eⁱᶿ¹,Fgrid;
end
T = Float32;
N = 128
Lx= 2π
GFkh,GPUgrid = GetFk(N,Lx;T=T,kf = 8,σ²= 1);
GFk = CUDA.zeros(T,GPUgrid.nx,GPUgrid.ny,GPUgrid.nz);
ldiv!(GFk, GPUgrid.rfftplan, deepcopy(GFkh));
CPUgrid = ThreeDGrid(CPU(),N,Lx;T=T);
CFkh = Array(GFkh);
CFk = zeros(T,GPUgrid.nx,GPUgrid.ny,GPUgrid.nz);
ldiv!(CFk, CPUgrid.rfftplan, deepcopy(CFkh));
figure(figsize = (16,8))
subplot(121);title("GPU T =$T",size=16)
imshow(Array(GFk[:,:,1]));
subplot(122);title("CPU T =$T",size=16)
imshow(Array(CFk[:,:,1]))
I’ll have a closer look in the morning from my laptop
It occurs to me that, the root of this problem is coming from the fft plan itself. Now, I don't even use grid fft plan itself but define a rfft plan using FFTW, I can reproduce the polluted result
T = Float32;
GFkh_32,GPUgrid = GetFk(N,Lx;T=T,kf = 8,σ²= 1);
GFk32 = CUDA.zeros(T,GPUgrid.nx,GPUgrid.ny,GPUgrid.nz);
GPUrfftplan_F32 = plan_rfft(GFk32);
ldiv!(GFk32, GPUrfftplan_F32, deepcopy(GFkh_32));
figure(figsize = (16,8))
subplot(121);title("GPU T =$T",size=16)
imshow(Array(GFk32[:,:,1]));
T = Float64;
GFkh_64,GPUgrid = GetFk(N,Lx;T=T,kf = 8,σ²= 1);
GFk64 = CUDA.zeros(T,GPUgrid.nx,GPUgrid.ny,GPUgrid.nz);
GPUrfftplan_F64 = plan_rfft(GFk64);
ldiv!(GFk64, GPUrfftplan_F64, deepcopy(GFkh_64));
subplot(122);title("GPU T =$T",size=16)
imshow(Array(GFk64[:,:,1]))
I am not sure if this is a FFTW.jl 's or CUDA.jl 's problem.
Quick Update here After a long trial and error, it seems that I am finally arriving to the root of problem. It may has something to do with the fft plan of 2^n 3D real transform. If you try switch the resolution of any 2^n 3D Map (128^3 in our case) to other even number (130^3). The discrepancy between CPU and GPU rfft would vanish. I suspect there may have a bug on the algorithm of implementing 2^n FP32 rfft but I am not sure if this is a FFTW problem or CUDA problem
Can we make sure it's not a plotting issue?
I tried to do some FFTs back and forth and couldn't find any disagreement between different data types. Have a look:
Here's an example just with FFTW and CUDA
```julia using CUDA using FFTW using LinearAlgebra: mul!, ldiv! using Test for T in [Float64, Float32] nx, ny, nz = 6, 8, 10 nk, nl, nm = Int(nx//2)+1, ny, nz a = randn(T, (nx, ny, nz)) a_cu = CuArray(a) ah = randn(Complex{T}, (nk, nl, nm)) ah_cu = CuArray(ah) plan = plan_rfft(a) plan_cu = plan_rfft(a_cu) bh = zeros(Complex{T}, (nk, nl, nm)) bh_cu = CuArray(bh) c = zeros(T, (nx, ny, nz)) c_cu = CuArray(c) mul!(bh, plan, a) ldiv!(c, plan, deepcopy(bh)) mul!(bh_cu, plan_cu, a_cu) ldiv!(c_cu, plan_cu, deepcopy(bh_cu)) @testset "CPU test for $T" begin @test bh == rfft(a) @test c == irfft(bh, nx) end @testset "GPU test for $T" begin @test bh_cu == rfft(a_cu) @test c_cu == irfft(bh_cu, nx) end end ```
and here's an example with CUDA + FourierFlows
```julia using CUDA using LinearAlgebra: mul!, ldiv! using FourierFlows using Test for T in [Float64, Float32] nx, ny, nz = 6, 8, 10 grid = ThreeDGrid(nx, nx, ny, ny, nz, nz; T) grid_cu = ThreeDGrid(GPU(), nx, nx, ny, ny, nz, nz; T) a = randn(T, (grid.nx, grid.ny, grid.nz)) a_cu = CuArray(a) ah = randn(Complex{eltype(grid)}, (grid.nkr, grid.nl, grid.nm)) ah_cu = CuArray(ah) plan = grid.rfftplan plan_cu = grid_cu.rfftplan bh = zeros(Complex{eltype(grid)}, (grid.nkr, grid.nl, grid.nm)) bh_cu = CuArray(bh) c = zeros(eltype(grid), (grid.nx, grid.ny, grid.nz)) c_cu = CuArray(c) mul!(bh, plan, a) ldiv!(c, plan, deepcopy(bh)) mul!(bh_cu, plan_cu, a_cu) ldiv!(c_cu, plan_cu, deepcopy(bh_cu)) @testset "CPU test for $T" begin @test bh == rfft(a) @test c == irfft(bh, nx) end @testset "GPU test for $T" begin @test bh_cu == rfft(a_cu) @test c_cu == irfft(bh_cu, nx) end end ```
All test pass for me. Same if I choose nx = ny = nz = 128
.
Can we make sure it's not a plotting issue?
I tried to do some FFTs back and forth and couldn't find any disagreement between different data types. Have a look:
Here's an example just with FFTW and CUDA
CUDA + FFTW and here's an example with CUDA + FourierFlows
CUDA + FourierFlows All test pass for me. Same if I choose
nx = ny = nz = 128
.
Thanks for checking! Same result for me but I think this problem would only occurs if we satisfy the following condition :
I quite certain the problem is not coming the plot as the radial spectrum of FP32-CUDA found a spike at large k number.
I cannot check the result using your method as the FFT result is a little different (~1e-18) between GPU and CPU result.
However, I come up a alternative by using \approx
instead of ==
inside the test. The Float64 is fine but fail in Float32.
using CUDA
using FFTW
using LinearAlgebra: mul!, ldiv!
using FourierFlows
using Test
nx, ny, nz = 128, 128, 128; # define size of the array
θ = rand(θ,div(nx,2)+1, ny, nz)*2π; # define \theta \in [0,2\pi]
for T in [Float64,Float32]
grid = ThreeDGrid(nx, nx, ny, ny, nz, nz; T)
grid_cu = ThreeDGrid(GPU(), nx, nx, ny, ny, nz, nz; T)
θT = convert(Array{T,3},θ)
eⁱᶿk⁻² = @. exp.(im*θT)*grid.invKrsq; #define the spectral filtered function in CPU
eⁱᶿk⁻²_cu = CuArray(eⁱᶿk⁻²); #define the spectral filtered function in GPU
plan = grid.rfftplan
plan_cu = grid_cu.rfftplan
c = zeros(eltype(grid), (grid.nx, grid.ny, grid.nz))
c_cu = CuArray(c);
ldiv!(c, plan, deepcopy(eⁱᶿk⁻²))
ldiv!(c_cu, plan_cu, deepcopy(eⁱᶿk⁻²_cu))
c_cu = Array(c_cu); #move the result from GPU-> CPU
test_title = "eⁱᶿk⁻² CPU & GPUtest for " * string(T)
@testset "$test_title" begin
#Compare the result using ≈
@test c_cu ≈ c
end
end
Can you print the types of eⁱᶿk⁻²
and eⁱᶿk⁻²_cu
to make sure they are correct type?
Before and after the conversion to Array for eⁱᶿk⁻²_cu
Before and after the conversion to Array for
eⁱᶿk⁻²_cu
do you mean before/after the rift or before the testing process? This is my type checking code and the result, as you can see, FP64 is fine but FP32 just fail
using CUDA
using FFTW
using LinearAlgebra: mul!, ldiv!
using FourierFlows
using Test
nx, ny, nz = 128, 128, 128; # define size of the array
θ = rand(div(nx,2)+1, ny, nz)*2π; # define \theta \in [0,2\pi]
for T in [Float64,Float32]
grid = ThreeDGrid(nx, nx, ny, ny, nz, nz; T)
grid_cu = ThreeDGrid(GPU(), nx, nx, ny, ny, nz, nz; T)
θT = convert(Array{T,3},θ)
eⁱᶿk⁻² = @. exp.(im*θT)*grid.invKrsq; #define the spectral filtered function in CPU
eⁱᶿk⁻²_cu = CuArray(eⁱᶿk⁻²); #define the spectral filtered function in GPU
println("before the rfft")
@show typeof(eⁱᶿk⁻²)
@show typeof(eⁱᶿk⁻²_cu)
plan = grid.rfftplan
plan_cu = grid_cu.rfftplan
c = zeros(eltype(grid), (grid.nx, grid.ny, grid.nz))
c_cu = CuArray(c);
ldiv!(c, plan, deepcopy(eⁱᶿk⁻²))
ldiv!(c_cu, plan_cu, deepcopy(eⁱᶿk⁻²_cu))
c_cu = Array(c_cu); #move the result from GPU-> CPU
println("before the test")
@show typeof(c)
@show typeof(c_cu)
test_title = "eⁱᶿk⁻² CPU & GPU test for " * string(T)
@testset "$test_title" begin
#Compare the result using ≈
@test c_cu ≈ c
end
end
@doraemonho sorry for slacking... I went out on leave and now slowly coming back.
Did you have any progress on this or is it still outstanding?
@doraemonho sorry for slacking... I went out on leave and now slowly coming back.
Did you have any progress on this or is it still outstanding?
@navidcy no worries, I still have this issue. The way of circumventing it is to change the resolution from 2^n to 10*n or switch to FP64 calculation. The problem only occurs in this kind of function
Just an update from what we were : -) This may be a CuFFT bug that would occur after the version CuFFT v.10.3. I reported it to NVIDIA and just heard back from them. They confirmed the issue and their CuFFT team is taking a look into it right now. Not sure when will be solved but I will keep updating when I get back from them
@doraemonho thanks for this!
Hi there! I'm working on cuFFT and following up from https://github.com/JuliaGPU/CUDA.jl/issues/1656 (internal NVIDIA bug# 3847913).
Sorry to hijack your issues - I thought it would be easier and faster to chat here directly. Let me know if you'd prefer to chat somewhere else (either https://github.com/JuliaGPU/CUDA.jl/issues/1656 or in the bug ticket).
Do you have more information about what exactly is the input to the complex-to-real transform that is being done by cuFFT ?
The reason why this matters is the following.
When doing a DFT of length N with a real input, if x
is the output and N
is even:
x[0]
is real;x[N/2]
is real;x[i] = x[N-i]*
(https://en.wikipedia.org/wiki/Discrete_Fourier_transform#DFT_of_real_and_purely_imaginary_signals)Given this, in practice, real-to-complex transforms work on N
real inputs and produce N/2+1
complex outputs. The complex output is such that the first two properties above are respected.
Complex-to-real transforms take N/2+1
complex input (y
) and produce N
real output and assume that
y[0]
is real;y[N/2]
is real
in memory.One might assume that imag(y[0])
or imag(y[N/2])
are not even read by the kernels.
But in some (very few) case it is, because it can improve performance. This is usually not a problem because the input to a complex-to-real FFT is usually the output of a real-to-complex FFT, which satisfies this property.
What I suspect is that your input doesn't satisfy one of the above properties. I've done some profiling of the Julia code and in the nx=128, FP32 case, I see GPU kernels which assume the above properties.
In https://github.com/JuliaGPU/CUDA.jl/issues/1656 you mentionned that
I've attached a cpp repro that shows this: when the last entry of the N/2+1 complex input to the complex-to-real FFT is not real (i.e., the imaginary part is non zero), in some cases (powers of 2 with enough batches on new-enough GPUs), the output will be wrong. If you compile and run it, for instance with CUDA 11.8, you will see
Symmetry error: 2.402546e-07
Error C2R output vs C2C output (with mistake ? 0): 1.307564e-07
Last value of input_c2r set to {6.389053,5.000000} instead of {6.389053, 0}
Error C2R output vs C2C output (with mistake ? 1): 1.566945e-02
I realize this is a subtle issue. It might also have nothing to do with this, but I thought this was worth a shot.
Let me know if this helps,
Hi there! I'm working on cuFFT and following up from [JuliaGPU/CUDA.jl#1656] .....
Hi, thanks for checking it out! As the problem started from this thread, I think we may better chat here?
Do you have more information about what exactly is the input to the complex-to-real transform that is being done by cuFFT?
The input is just a random phase function $e^{i\phi}k\^{-2}$. The first term is just Euler's formula with some random phase angle bounded between 0 and $2\pi$, and $k\^{-2}$ can be referred to $(\sqrt{3}\times N\times x_i )^{-2}$, where $x_i$ is discrete coordinate number bound between 0 and 1. This is one classical way of injecting turbulence energy in fluid simulation.
We always set $k\^{-2} =0$ when $x_i = 0$ to avoid singularity but other terms are nonzeros and remain imaginary. So I think
y[0] = 0
doesn't hold in our case. I work out a test of setting $y[0] = 0$ and it works! (Code showed at bottom)
I have one more question to ask. I was trying CuFFT v.10.3.0
for the same setup but the result was fine. However, the symptom only occurs in the newer version of CuFFT
library. Are complex to real transform algorithm changes during the recent version?
using CUDA
using FFTW
using Random
using LinearAlgebra: mul!, ldiv!
using Test
function getk⁻²_and_rfftplan(dev::Dev;
nx = 64, ny = nx, nz = nx, Lx = 2π, Ly = Lx, Lz = Lx,
T = Float32, nthreads=Sys.CPU_THREADS, effort=FFTW.MEASURE) where Dev
ArrayType = dev == "CPU" ? Array : CuArray;
# Define the rfft parameters
nk = nx
nl = ny
nm = nz
nkr = Int(nx/2 + 1)
# Wavenubmer
k = ArrayType{T}(reshape( fftfreq(nx, 2π/Lx*nx), (nk, 1, 1)))
l = ArrayType{T}(reshape( fftfreq(ny, 2π/Ly*ny), (1, nl, 1)))
m = ArrayType{T}(reshape( fftfreq(nz, 2π/Lz*nz), (1, 1, nm)))
kr = ArrayType{T}(reshape(rfftfreq(nx, 2π/Lx*nx), (nkr, 1, 1)))
Ksq = @. k^2 + l^2 + m^2
invKsq = @. 1 / Ksq
CUDA.@allowscalar invKsq[1, 1, 1] = 0;
Krsq = @. kr^2 + l^2 + m^2
invKrsq = @. 1 / Krsq
CUDA.@allowscalar invKrsq[1, 1, 1] = 0;
FFTW.set_num_threads(nthreads);
rfftplan = plan_rfft(ArrayType{T, 3}(undef, nx, ny, nz))
return invKrsq,rfftplan
end
# Set up the initial condition
Random.seed!(1234); # Set up the random seed
θ = rand(div(nx,2)+1, ny, nz)*2π; # define \theta \in [0,2\pi]
nx = ny = nz= 128
T = Float32
k⁻², rfftplan = getk⁻²_and_rfftplan("CPU"; nx = nx, Lx = 2π, ny = ny, nz = ny, T = T)
k⁻²_c,rfftplan_c = getk⁻²_and_rfftplan("GPU"; nx = nx, Lx = 2π, ny = ny, nz = ny, T = T)
θT = im.*convert(Array{T,3},θ)
eⁱᶿk⁻² = @. exp.(θT)*k⁻²; #define the spectral filtered function in CPU
eⁱᶿk⁻²_cu = CuArray(eⁱᶿk⁻²); #define the spectral filtered function in GPU
c_b4_cleaning = CUDA.zeros(T, nx, ny, nz);
c_af_cleaning = CUDA.zeros(T, nx, ny, nz);
# Work out the rfft before cleaning 0 terms
ldiv!(c_b4_cleaning, rfftplan_c, deepcopy(eⁱᶿk⁻²_cu));
#Clean y[N/2] terms (index 0 = 1 julia)
@. eⁱᶿk⁻²_cu[1,:,:] .= 0;
ldiv!(c_af_cleaning, rfftplan_c, deepcopy(eⁱᶿk⁻²_cu));
c_b4_cleaning = Array(c_b4_cleaning); #move the result from GPU-> CPU
c_af_cleaning = Array(c_af_cleaning); #move the result from GPU-> CPU
figure(figsize=(12,6))
subplot("121");
imshow(c_b4_cleaning[:,:,100]);title("Before cleaning y[1,:,:]")
subplot("122");
imshow(c_af_cleaning[:,:,100]);title("After cleaning y[1,:,:]")
Are complex to real transform algorithm changes during the recent version?
Yes, there are occasional changes. I doesn't surprise me that you didn't see this behavior in older versions.
I work out a test of setting and it works! (Code showed at bottom)
Awesome!
If that solves your problem, we'd appreciate if you can follow up on the Bug ticket.
In think in summary
Yes, there are occasional changes. I doesn't surprise me that you didn't see this behavior in older versions.
Thanks for the information! I appreciate your help. I would follow the ticket then.
@navidcy would you think we should keep this issue open or just close it?
Let’s keep it open for now so that people can see that there is an issue. Perhaps we should issue a warning if someone makes a ThreeDGrid w Float32 on GPU. The issue is only for 3D, right?
Let’s keep it open for now so that people can see that there is an issue. Perhaps we should issue a warning if someone makes a ThreeDGrid w Float32 on GPU. The issue is only for 3D, right?
From the description, it seems to apply to all cases(1D/2D/3D) as long as the Fourier transformed $x{1}$ and $x{N/2+1}$ is not 0 before they do the Complex to Real transform.
This applies to 1D, 2D and 3D yes.
If the imaginary part of x[0]
and x[N/2+1]
is not zero, computing a complex-to-real DFT doesn't really make sense since the output of the DFT isn't guaranteed to be real. If you still want to compute the DFT of such signal, I would suggest to do a complex-to-complex DFT. But the output might not be real.
FYI, this is also documented in section 2.3 of cuFFT's online documentation
https://docs.nvidia.com/cuda/cufft/index.html#fft-types
I'm confused a bit.
If we start with a real signal and Fourier transform it then the components that correspond to 0-th frequency and to $N/2+1$-frequency aren't they real?
$ julia --project
_
_ _ _(_)_ | Documentation: https://docs.julialang.org
(_) | (_) (_) |
_ _ _| |_ __ _ | Type "?" for help, "]?" for Pkg help.
| | | | | | |/ _` | |
| | |_| | | | (_| | | Version 1.8.2 (2022-09-29)
_/ |\__'_|_|_|\__'_| |
|__/ |
julia> using LinearAlgebra
julia> using CUDA
julia> using CUDA.CUFFT
julia> a = cu(rand(8))
8-element CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}:
0.4507753
0.06592999
0.96525353
0.6843841
0.94825345
0.046085697
0.38072583
0.8913641
julia> rfft(a)
5-element CuArray{ComplexF32, 1, CUDA.Mem.DeviceBuffer}:
4.4327717f0 + 0.0f0im
-0.33708918f0 - 0.4522028f0im
0.053049427f0 + 1.4637324f0im
-0.65786713f0 + 0.71685266f0im
1.0572441f0 + 0.0f0im
julia> ah =complex(cu(zeros(5)))
5-element CuArray{ComplexF32, 1, CUDA.Mem.DeviceBuffer}:
0.0f0 + 0.0f0im
0.0f0 + 0.0f0im
0.0f0 + 0.0f0im
0.0f0 + 0.0f0im
0.0f0 + 0.0f0im
julia> rfftplan = plan_rfft(CuArray{Float32, 1}(undef, 8))
CUFFT real-to-complex forward plan for 8-element CuArray of Float32
julia> mul!(ah, rfftplan, a)
5-element CuArray{ComplexF32, 1, CUDA.Mem.DeviceBuffer}:
4.4327717f0 + 0.0f0im
-0.33708918f0 - 0.4522028f0im
0.053049427f0 + 1.4637324f0im
-0.65786713f0 + 0.71685266f0im
1.0572441f0 + 0.0f0im
julia> b = deepcopy(a*0)
8-element CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}:
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
julia> ldiv!(b, rfftplan, ah)
8-element CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}:
0.45077527
0.06593
0.9652535
0.68438405
0.9482534
0.046085723
0.38072583
0.89136404
(same results in 2D and 3D...)
OK, I see... the issue comes when we generate the ah
and not take ah
as the rFFT of a real array. That's perfectly reasonable and indeed it's something expected.
But why that's not an issue with Float64?
Because cuFFT FP64 transforms use (slightly) different algorithms, so the issue is not visible.
I guess it seems that a warning should at least be displayed when user does give an N/2-Fourier component with non-zero imaginary part? I guess by CUDA.jl?
Hi, I am working on the 3D RFFT but find out the result of rfft using CUDA is a bit weird in Float32 .
I define a function
F(k)
in k-space as a gaussian function likee^{-(k-k0)^2}e^{-i\phi}
, where k0 is some number and phi is a random number array between 0 and 2pi. Here is the implementationIf I do a rfft on this function, the result of CUDA rfft will overlay a moiré like pattern like below:![image](https://user-images.githubusercontent.com/20443810/179374024-528fe9d5-3f53-4431-b959-9b5671eb038f.png)
Here is the implemention:
However, if you change T from Float32 to Float64, the result would become normal again:![image](https://user-images.githubusercontent.com/20443810/179374303-ede11059-8cb4-4b30-9903-daf5394bf3c9.png)
The problem itself is rare and would happen in this kind of function.