Closed paquiteau closed 2 years ago
Dear Pierre-Antoine,
Thanks for the reproducible test code. I have not had time to run it. It's surprising. We do stand behind our timings. What GPU do you use? Let's take the 3d 1-coil, where you have 6.1M NU pts, and the fwd transform (type 2): based on your timings,
So, rather than us mis-measuring gpunufft's speed, this seems to be a performance issue with your calling cufinufft that is much worse than expected. Could it be our Python wrapper?
I will see if @MelodyShih or @JBlaschke have any ideas.
Best, Alex
Hi,
It's surprising. We do stand behind our timings. What GPU do you use?
We could reproduces those results on an GTX Quadro M4000, Quadro P400 and The V100 of Google Colab. So the results does not seem to be GPU dependent.
To benchmark the cufiNUFFT with python I based my implementation on the examples provided such as examples/example2d2many.py
. The cufiNUFFT call are very close to machine code I would say whereas the gpuNUFFT is called through the wrapper NonCartesianFFT
Class of pysap-mri.
Moreover, the timing for gpuNUFFT includes H2D, Kernels and D2H. for cufiNUFFT its only H2D and Kernels.
The fact that 2D is faster seems to indicate that the memory transferts are detrimental. Using paged-lock memory may give greater speed.
I think, at this moment, it makes sense to get NVVP profiles for each version to get better clarity. @ahbarnett , is there a version of code we could use for testing which doesn't have pythonic wrapper?
From what I remember, using python wrapper can slow things down in 3D if you are creating memories for it and copying stuff over (D2H ie), but then, I don't think cufiNUFFT does it.
Hi all, I benchmark with the C++ codes directly so I won't rule out yet the python wrapper could be the reason of the timing mismatch here. Also, note that in figure 4 of our paper, we have #NUpts (sampling pts) being 10 times the #U pts, instead of them being in the same order ,so the test case here is slightly different.
Here are the two timing results on a V100 GPU I got this morning. The NUpts are iid uniform random variables across the box.
[yhs264@gv001 cufinufft]$ bin/cufinufft3d1_test_32 2 128 128 128 2097152 1e-3
[time ] dummy warmup call to CUFFT 0.293 s
[time ] cufinufft plan: 0.00117 s
[time ] cufinufft setNUpts: 0.000957 s
[time ] cufinufft exec: 0.0201 s
[time ] cufinufft destroy: 0.000852 s
[time ] cufinufft memcpy: 0.0013 s <--- D2H
[Method 2] 2097152 NU pts to 2097152 U pts in 0.0244 s: 8.6e+07 NU pts/s
(exec-only thoughput: 1.04e+08 NU pts/s)
---> this gives about 11 ns/ nupts, basically same as the speed of gpunufft. I include the D2H timing here. Note that the memory copy time is only 6% of the execution time, it won't affect the speed too much if we further include the H2D timing.
[yhs264@gv001 cufinufft]$ bin/cufinufft3d1_test_32 2 128 128 128 16777216 1e-3
[time ] dummy warmup call to CUFFT 0.282 s
[time ] cufinufft plan: 0.00118 s
[time ] cufinufft setNUpts: 0.00608 s
[time ] cufinufft exec: 0.0412 s
[time ] cufinufft destroy: 0.00096 s
[time ] cufinufft memcpy: 0.00129 s
[Method 2] 16777216 NU pts to 2097152 U pts in 0.0507 s: 3.31e+08 NU pts/s
(exec-only thoughput: 4.07e+08 NU pts/s)
---> this gives about 3 ns/ nupts, which is same as what we reported in Figure 4.
Chaithya, you could find a test code here (type 1, adjoint) and here (type2, forward).
Best regards, Melody
Thank you so much Melody... We will test this and maybe fix the python wrapper if that's slowing us down while using cufinufft.
Hi Folks -- just catching up on this. If the python wrapper is slowing things down, then I would really love to know why. The python wrappers use ctypes
, so my instinct would be to assume very low overhead (or at least fixed overhead -- as we're passing pointers around, so at worst case we're copying a couple more numbers than we would in pure C code.)
@MelodyShih @ahbarnett Do the tests here https://github.com/flatironinstitute/cufinufft/blob/master/python/cufinufft/tests/ already follow https://github.com/flatironinstitute/cufinufft/tree/master/test ? If not I can write the python tests that match the pure C ones and profile them.
I'm sure this was already ruled out, but it might be worth double checking... If the data coming from the python code is not of the correct (complex) dtype, order, and contiguous there could be a deep copy (or worse)... We run into this in ASPIRE where we have float data, for example.
Agreed @garrettwrong -- another issue could be device management / memcpy https://colab.research.google.com/drive/1IuOgwshEG04GVaZoinZNdbRkPmG-Kadp?usp=sharing mostly omits this from the timed loop. (@paquiteau in general I recommend changing your timed
so that it doesn't time the first call).
One thing I notices in @paquiteau 's profiling code is that the gpuNFFT
is timed using:
t,r = timed(lambda: gpuNUFFT.op(img3d))
whereas the cufinufft
code is profiled using:
t,r = timed(lambda: plan.execute(kspace3d_gpu, to_gpu(img3d)))
which includes the to_gpu
HtoD memcopy.
@garrettwrong ASPIRE == https://computationalcryoem.github.io/ASPIRE-Python/index.html ?
I think gpuNUFFT carries out H2D copies (and also D2H) internally. So we include them all.
@chaithyagr does gpuNUFFT
always copy, or does it detect if the data is already on the GPU (Your timed
loop runs 10 iterations, so I wonder if we're comparing [HtoD, gpuNUFFT, gpuNUFFT, gpuNUFFT, gpuNUFFT, gpuNUFFT, gpuNUFFT, gpuNUFFT, gpuNUFFT, gpuNUFFT, gpuNUFFT]
with [HtoD, cufinufft, HtoD, cufinufft, HtoD, cufinufft, HtoD, cufinufft, HtoD, cufinufft, HtoD, cufinufft, HtoD, cufinufft, HtoD, cufinufft, HtoD, cufinufft, HtoD, cufinufft]
)
Hi everyone (its very nice to see such will power to solve this issue)
I think gpuNUFFT carries out H2D copies (and also D2H) internally. So we include them all.
Its even worse than that: The data is first copied to pinned memory , then transfered on the gpu:
https://github.com/paquiteau/pygpuNUFFT/blob/master/python/src/python_factory.cpp#L101
and then the copy is done coil wise here:
https://github.com/paquiteau/pygpuNUFFT/blob/master/CUDA/src/gpuNUFFT_operator.cpp#L1178
(@paquiteau in general I recommend changing your timed so that it doesn't time the first call).
@JBlaschke : I have updated the notebook with your suggestion, the results remains the same.
@chaithyagr does gpuNUFFT always copy, or does it detect if the data is already on the GPU
yes, it always copy. (sadly)
Right! I know this is the standard path for GPU code development, so I am not judging. Also @MelodyShih 's benchmark figures include the memcopies.
Hi everyone. THanks Melody for your C++ test verifications. It's weird that the high-density case is a factor nearly 4 different per-NU-point.
Johannes:
@MelodyShih https://github.com/MelodyShih @ahbarnett https://github.com/ahbarnett Do the tests here https://github.com/flatironinstitute/cufinufft/blob/master/python/cufinufft/tests/ already follow https://github.com/flatironinstitute/cufinufft/tree/master/test ? If not I can write the python tests that match the pure C ones and profile them.
No, the Py demos are merely for correctness and were done by Joakim for our paper submission 1 year ago. Any Py-based timing tests, that match a C++-based test like Melody just did, would be amazing to add to the Py tests dir, and would help resolve some of these speed anomalies!
Thanks all for the collaboration - Alex
Any Py-based timing tests, that match a C++-based test like Melody just did, would be amazing to add to the Py tests dir, and would help resolve some of these speed anomalies!
Agreed. I'll make an issue.
Hi all, following it the python code that I wrote (adapting from the example2d1many.py
) to test the performance. We could probably reuse part of it for the python timing tests.
I choose the problem size as in the 3D single coil case and borrow the timing function from @paquiteau 's code.
import numpy as np
from time import perf_counter
import pycuda.autoinit
from pycuda.gpuarray import GPUArray, to_gpu
from cufinufft import cufinufft
def timed(fun):
g_start=perf_counter()
stop = g_start
t=[]
while stop-g_start < 10:
start=perf_counter()
res = fun()
stop =perf_counter()
t.append(stop-start)
if len(t) > 1:
t.pop(0)
return t , res
N1, N2, N3 = 128, 128, 160 # Size of uniform grid
M = 6136781 # Number of nonuniform points
n_transf = 1 # Number of input arrays
eps = 1e-3 # Requested tolerance
dtype = np.float32 # Datatype (real)
complex_dtype = np.complex64 # Datatype (complex)
kx = np.random.uniform(-np.pi, np.pi, size=M)
ky = np.random.uniform(-np.pi, np.pi, size=M)
kz = np.random.uniform(-np.pi, np.pi, size=M)
c = (np.random.standard_normal((n_transf, M)) + 1j * np.random.standard_normal((n_transf, M)))
kx = kx.astype(dtype)
ky = ky.astype(dtype)
kz = kz.astype(dtype)
c = c.astype(complex_dtype)
fk_gpu = GPUArray((n_transf, N1, N2, N3), dtype=complex_dtype)
##### Type 1 #######
type1plan = cufinufft(1, (N1, N2, N3), n_transf, eps=eps, dtype=dtype)
type1plan.set_pts(to_gpu(kx), to_gpu(ky), to_gpu(kz))
type1plan.execute(to_gpu(c), fk_gpu)
t, r = timed(lambda: type1plan.execute(to_gpu(c), fk_gpu))
print(f'3d1: {np.mean(t)*1000:9.2f}ms ± {np.std(t)*1000:3.3f}')
##### Type 2 #######
fk = fk_gpu.get()
c_gpu = GPUArray((n_transf, M), dtype=complex_dtype)
type2plan = cufinufft(2, (N1, N2, N3), n_transf, eps=eps, dtype=dtype)
type2plan.set_pts(to_gpu(kx), to_gpu(ky), to_gpu(kz))
type2plan.execute(c_gpu, to_gpu(fk))
t, r = timed(lambda: type2plan.execute(c_gpu, to_gpu(fk)))
print(f'3d2: {np.mean(t)*1000:9.2f}ms ± {np.std(t)*1000:3.3f}')
Here are the timing results:
[yhs264@gv001 examples]$ python3 example3d.py
3d1: 27.56ms ± 0.381
3d2: 18.49ms ± 0.282
[yhs264@gv001 examples]$ python3 example3d.py
3d1: 66.18ms ± 1.431
3d2: 170.12ms ± 1.282
Two differences I could think of are
make python
rather than using the precompiled distribution (I'm using CUDA 11.3.1). I am not sure what else can cause the difference. Hopefully the results here could help and guide us to find the reason.
Best regards, Melody
FYI, with #130, the timing becomes:
[yhs264@gv001 examples]$ python3 example3d.py
3d1: 27.56ms ± 0.699
3d2: 16.56ms ± 0.318
[yhs264@gv001 examples]$ python3 example3d.py
3d1: 66.00ms ± 0.338
3d2: 33.30ms ± 2.276
Hi @MelodyShih , this looks very promising. could you try with the same distribution of point as ours ? Typical MRI reconstruction have an higher density of point in the center of kspace. And its the case with the provided dataset (https://perso.crans.org/comby/dataset_pysap.npy, also available in the colab notebook)
You could make a concentrated NU pts distribution just by taking the sqrt of (the magnitude of) each coordinate. Or, only 1 or 2 of the coords if want less concentration.
Hi @paquiteau , I tried running with your dataset. Here is the timing I got on V100:
3d2 (op): 10.25ms ± 0.505
3d1 (adjop): 36.51ms ± 0.426
Thank you @MelodyShih , so the performance lost of cufinufft would be due to not hardware-dedicated compilation ? for gpuNUFFT I used cuda integration in CMake to automatically compile for the user's architechture (see here).
Thank you @MelodyShih , we are also re-compiling using CUDA11 at our end to make sure that the numbers match. I was wondering if the issue is just because of underperforming binaries at pip? Which CUDA was used to create the binaries at pip currently?
I would recommend either creating binaries specific to CUDA like cupy (cupy-cuda100, cupy-cuda110 etc) or have a way in which the binaries can be compiled on the fly like how it is done in gpuNUFFT as pointed by @paquiteau
Personally, I feel given the compile time is large for cufinufft, having separate per CUDA binary (or at least the latest CUDA binary) will be helpful perhaps? I can help if needed with a PR to generate this / have a CD to generate this.
Which CUDA was used to create the binaries at pip currently?
We are currently using CUDA 10.2, IIRC.
I would recommend either creating binaries specific to CUDA like cupy (cupy-cuda100, cupy-cuda110 etc) or have a way in which the binaries can be compiled on the fly like how it is done in gpuNUFFT as pointed by @paquiteau
I think this makes sense. We did consider it, but the increased complexity (and consequent maintenance responsibility) led us to the fixed CUDA version. That being said, if there is a considerable speedup here, it may be worth looking into. I should say, however, that we are planning on moving to CUDA 11 exclusively relatively soon, so this would be mainly to support other CUDA versions.
Hello again, I have run some profiling tools to see what could be wrong with our setup. I manually compiled cufiNUFFT for the GPU of my machine (Quadro M4000, compute capability 5.2, cuda 11.0 ). And run a script using Nvidia Visual Profiler. I am using the same setup as before (3D radial MRI dataset, 32 coils ).
Here is the annotated screen capture of the profiling:
The raw profiling file is there: gpunufft_vs_cufi3d.zip
The two main bottleneck for cufiNUFFT are the memory transfert (they have a low troughput, and are done synchronously), and the kernel Spread_3d_Subproblem
(which takes 500ms per transform, compares to the 10 ms required for the other kernels of cufinufft).
At this point this might be a gpu issue, is cufiNUFFT codebase optimized for specific compute capabilities ?
You can also see that gpuNUFFT is using asynchronous copy for the coil data, ie doing the copy of coil N while copying the data for coil N+1. (that was the work of @chaithyagr). We also have idea to make the computation faster in this multicoil setup especially in the case of data consistency operation e.g. gradient of ||Fx-y||_2^2 .
for the sake of this benchmark , I added basic support of cufinufft for pysap-mri, our main software to make research on mri reconstruction: https://github.com/paquiteau/pysap-mri/blob/pygpuNUFFT/mri/operators/fourier/non_cartesian.py
Hi paquiteau, Thanks for sharing the profiling results. To answer the question about what compute capabilities we optimized for, we focus on the Volta micro-architecture and hence compute capabilities 7.0 and above.
I have now access to an RTX 5000 (Turing Architecture compute capability 7.5). So I ran the bench mark again, and this time I am using nvidia nsight system (nvvp is deprecated/ill supported for recent GPUs). I also limited myself to only use single coil data (cufinufft does not use concurent computation on coils dimension anyway)
The problem still persist, even if we compare in term of raw kernel performance
This is gpuNUFFT timeline:
This is cufinufft timeline:
Global view of the benchmark (first initialisation, then computation of gpuNUFFT, then cufinufft init (not visible, its fast) and computation)
Note that the time axis for the two first capture is the same scale, but due to github import the scale of gpuNUFFT is zoomed in.
The raw profiling file is available here: https://perso.crans.org/comby/Report%204.nsys-rep
and the script use is there: https://github.com/paquiteau/notebooks/blob/master/mri_cufi.py
hmm I'm not sure what causes the difference comparing to the timing on V100. How many stream multiprocessors does RTX5000 have? Btw, are you using the latest codes? There is a "bug" fixed that should help with the timing for type 2 transforms (not for type 1 though).
Here is the deviceQuery for the RTX 5000:
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "Quadro RTX 5000"
CUDA Driver Version / Runtime Version 11.4 / 11.2
CUDA Capability Major/Minor version number: 7.5
Total amount of global memory: 16122 MBytes (16905338880 bytes)
(48) Multiprocessors, ( 64) CUDA Cores/MP: 3072 CUDA Cores
GPU Max Clock rate: 1815 MHz (1.81 GHz)
Memory Clock rate: 7001 Mhz
Memory Bus Width: 256-bit
L2 Cache Size: 4194304 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total shared memory per multiprocessor: 65536 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 1024
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 3 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device supports Managed Memory: Yes
Device supports Compute Preemption: Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 213 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 11.4, CUDA Runtime Version = 11.2, NumDevs = 1
Result = PASS
Yes, I saw you merged PR and used an up-to-date version of cufinufft.
Hello again, In the past weeks, I have been focusing on creating a wrapper around cufinufft dedicated to MRI reconstruction, in python, using cupy: https://github.com/paquiteau/mri-cufinufft. It add basic density compensation and Sensitivity maps supports.
The conclusion to this story is for me the following:
gpuNUFFT is still buggy and unmaintained, cufinufft is easily extensible, and provide great customization. So despite the performance loss, I will switch to use cufinufft, and you may see a PR someday to improve the performance.
Below you will find a detailled benchmark I did using an RTX 5000.
In the following charts the labels reads as follow:
And as always, op=type 2 , adj_op=type 1
2D Benchmark, 512x512, Single Channel, Radial Trajectory
2D Benchmark, 512x512, 32 channels, Radial Trajectory
3D Benchmark, 128x128x160, Single channels, Radial Trajectory
3D Benchmark, 128x128x160, 32 channels, Radial Trajectory
Hi, with @chaithyagr we continued development of gpuNUFFT (https://github.com/paquiteau/pygpuNUFFT and https://github.com/chaithyagr/gpuNUFFT) library.
We run some benchmark test and we found some performance issue on CufiNUFFt, contradicting the results shown on Figure 4 in https://arxiv.org/abs/2102.08463
In particular, Forward (Type II NUFFT, image to fourier domain) and Adjoint (Type I NUFFT, fourier domain to image) cufiNUFFT operation are slower than gpuNUFFT in 3D both for single channel and multichannel data.
We use a samples from the PySAP dataset, with the same order of magnitude of resolution size and number of sampling point in fourier space.
For 32 coil data: (Image is 32x128x128x160)
for single channel data (Image is 128x128x160)
The difference of speed up between the single and multichannel example is due to some concurrency usage in gpuNUFFT. Moreover, we considered only Computation and D2H copy times for cufiNUFFT, whereas gpuNUFFT times also includes H2D copy.
Nevertheless, cufiNUFFT remains faster for 2D data.
A minimal reproducible example is available on google collab: