paquiteau commented 2 years ago

Hi, with @chaithyagr we continued development of gpuNUFFT (https://github.com/paquiteau/pygpuNUFFT and https://github.com/chaithyagr/gpuNUFFT) library.

We run some benchmark test and we found some performance issue on CufiNUFFt, contradicting the results shown on Figure 4 in https://arxiv.org/abs/2102.08463

In particular, Forward (Type II NUFFT, image to fourier domain) and Adjoint (Type I NUFFT, fourier domain to image) cufiNUFFT operation are slower than gpuNUFFT in 3D both for single channel and multichannel data.

We use a samples from the PySAP dataset, with the same order of magnitude of resolution size and number of sampling point in fourier space.

For 32 coil data: (Image is 32x128x128x160)

gpu3d_op     :   1582.74ms  ± 3.853 (x5.65 speedup)
cufi3d_op    :   8952.54ms  ± 2.771 

gpu3d_adjop  :   1739.30ms  ± 10.767 (x2.39 speedup)
cufi3d_adjop :   4159.96ms  ± 5.939

for single channel data (Image is 128x128x160)

gpu3d_op     :     73.42ms  ± 1.119 ( x3.8 speedup)
cufi3d_op    :    279.36ms  ± 0.552 

gpu3d_adjop  :     62.73ms  ± 0.640 ( x2.1 speedup)
cufi3d_adjop :    132.27ms  ± 0.697

The difference of speed up between the single and multichannel example is due to some concurrency usage in gpuNUFFT. Moreover, we considered only Computation and D2H copy times for cufiNUFFT, whereas gpuNUFFT times also includes H2D copy.

Nevertheless, cufiNUFFT remains faster for 2D data.

A minimal reproducible example is available on google collab:

https://colab.research.google.com/drive/1IuOgwshEG04GVaZoinZNdbRkPmG-Kadp?usp=sharing

ahbarnett commented 2 years ago

Dear Pierre-Antoine,

Thanks for the reproducible test code. I have not had time to run it. It's surprising. We do stand behind our timings. What GPU do you use? Let's take the 3d 1-coil, where you have 6.1M NU pts, and the fwd transform (type 2): based on your timings,

gpunufft seems to get about 12ns / NUpt (which is roughly consistent with our finding of 18ns / NU pt in bottom-right of Fig. 4).
cufinufft you claim about 50 ns / NU pt. However we know that our throughput was 4-5 ns / NU pt at eps=1e-3. So there is a factor of 10 missing here!

So, rather than us mis-measuring gpunufft's speed, this seems to be a performance issue with your calling cufinufft that is much worse than expected. Could it be our Python wrapper?

I will see if @MelodyShih or @JBlaschke have any ideas.

Best, Alex

paquiteau commented 2 years ago

Hi,

It's surprising. We do stand behind our timings. What GPU do you use?

We could reproduces those results on an GTX Quadro M4000, Quadro P400 and The V100 of Google Colab. So the results does not seem to be GPU dependent.

To benchmark the cufiNUFFT with python I based my implementation on the examples provided such as examples/example2d2many.py. The cufiNUFFT call are very close to machine code I would say whereas the gpuNUFFT is called through the wrapper NonCartesianFFT Class of pysap-mri.

Moreover, the timing for gpuNUFFT includes H2D, Kernels and D2H. for cufiNUFFT its only H2D and Kernels.

The fact that 2D is faster seems to indicate that the memory transferts are detrimental. Using paged-lock memory may give greater speed.

chaithyagr commented 2 years ago

I think, at this moment, it makes sense to get NVVP profiles for each version to get better clarity. @ahbarnett , is there a version of code we could use for testing which doesn't have pythonic wrapper?

From what I remember, using python wrapper can slow things down in 3D if you are creating memories for it and copying stuff over (D2H ie), but then, I don't think cufiNUFFT does it.

MelodyShih commented 2 years ago

Hi all, I benchmark with the C++ codes directly so I won't rule out yet the python wrapper could be the reason of the timing mismatch here. Also, note that in figure 4 of our paper, we have #NUpts (sampling pts) being 10 times the #U pts, instead of them being in the same order ,so the test case here is slightly different.

Here are the two timing results on a V100 GPU I got this morning. The NUpts are iid uniform random variables across the box.

[yhs264@gv001 cufinufft]$ bin/cufinufft3d1_test_32 2 128 128 128 2097152 1e-3
[time  ] dummy warmup call to CUFFT  0.293 s
[time  ] cufinufft plan:         0.00117 s
[time  ] cufinufft setNUpts:         0.000957 s
[time  ] cufinufft exec:         0.0201 s
[time  ] cufinufft destroy:      0.000852 s
[time  ] cufinufft memcpy:       0.0013 s <--- D2H
[Method 2] 2097152 NU pts to 2097152 U pts in 0.0244 s: 8.6e+07 NU pts/s
                    (exec-only thoughput: 1.04e+08 NU pts/s)

---> this gives about 11 ns/ nupts, basically same as the speed of gpunufft. I include the D2H timing here. Note that the memory copy time is only 6% of the execution time, it won't affect the speed too much if we further include the H2D timing.

NUpts = 8x #Upts

[yhs264@gv001 cufinufft]$ bin/cufinufft3d1_test_32 2 128 128 128 16777216 1e-3
[time  ] dummy warmup call to CUFFT  0.282 s
[time  ] cufinufft plan:         0.00118 s
[time  ] cufinufft setNUpts:         0.00608 s
[time  ] cufinufft exec:         0.0412 s
[time  ] cufinufft destroy:      0.00096 s
[time  ] cufinufft memcpy:       0.00129 s
[Method 2] 16777216 NU pts to 2097152 U pts in 0.0507 s:    3.31e+08 NU pts/s
                (exec-only thoughput: 4.07e+08 NU pts/s)

---> this gives about 3 ns/ nupts, which is same as what we reported in Figure 4.

Chaithya, you could find a test code here (type 1, adjoint) and here (type2, forward).

Best regards, Melody

chaithyagr commented 2 years ago

Thank you so much Melody... We will test this and maybe fix the python wrapper if that's slowing us down while using cufinufft.

JBlaschke commented 2 years ago

Hi Folks -- just catching up on this. If the python wrapper is slowing things down, then I would really love to know why. The python wrappers use ctypes, so my instinct would be to assume very low overhead (or at least fixed overhead -- as we're passing pointers around, so at worst case we're copying a couple more numbers than we would in pure C code.)

@MelodyShih @ahbarnett Do the tests here https://github.com/flatironinstitute/cufinufft/blob/master/python/cufinufft/tests/ already follow https://github.com/flatironinstitute/cufinufft/tree/master/test ? If not I can write the python tests that match the pure C ones and profile them.

garrettwrong commented 2 years ago

I'm sure this was already ruled out, but it might be worth double checking... If the data coming from the python code is not of the correct (complex) dtype, order, and contiguous there could be a deep copy (or worse)... We run into this in ASPIRE where we have float data, for example.

JBlaschke commented 2 years ago

Agreed @garrettwrong -- another issue could be device management / memcpy https://colab.research.google.com/drive/1IuOgwshEG04GVaZoinZNdbRkPmG-Kadp?usp=sharing mostly omits this from the timed loop. (@paquiteau in general I recommend changing your timed so that it doesn't time the first call).

One thing I notices in @paquiteau 's profiling code is that the gpuNFFT is timed using:

t,r = timed(lambda: gpuNUFFT.op(img3d))

whereas the cufinufft code is profiled using:

t,r = timed(lambda: plan.execute(kspace3d_gpu, to_gpu(img3d)))

which includes the to_gpu HtoD memcopy.

@garrettwrong ASPIRE == https://computationalcryoem.github.io/ASPIRE-Python/index.html ?

chaithyagr commented 2 years ago

I think gpuNUFFT carries out H2D copies (and also D2H) internally. So we include them all.

JBlaschke commented 2 years ago

@chaithyagr does gpuNUFFT always copy, or does it detect if the data is already on the GPU (Your timed loop runs 10 iterations, so I wonder if we're comparing [HtoD, gpuNUFFT, gpuNUFFT, gpuNUFFT, gpuNUFFT, gpuNUFFT, gpuNUFFT, gpuNUFFT, gpuNUFFT, gpuNUFFT, gpuNUFFT] with [HtoD, cufinufft, HtoD, cufinufft, HtoD, cufinufft, HtoD, cufinufft, HtoD, cufinufft, HtoD, cufinufft, HtoD, cufinufft, HtoD, cufinufft, HtoD, cufinufft, HtoD, cufinufft])

paquiteau commented 2 years ago

Hi everyone (its very nice to see such will power to solve this issue)

I think gpuNUFFT carries out H2D copies (and also D2H) internally. So we include them all.

Its even worse than that: The data is first copied to pinned memory , then transfered on the gpu:

https://github.com/paquiteau/pygpuNUFFT/blob/master/python/src/python_factory.cpp#L101

and then the copy is done coil wise here:

https://github.com/paquiteau/pygpuNUFFT/blob/master/CUDA/src/gpuNUFFT_operator.cpp#L1178

(@paquiteau in general I recommend changing your timed so that it doesn't time the first call).

@JBlaschke : I have updated the notebook with your suggestion, the results remains the same.

@chaithyagr does gpuNUFFT always copy, or does it detect if the data is already on the GPU

yes, it always copy. (sadly)

JBlaschke commented 2 years ago

Right! I know this is the standard path for GPU code development, so I am not judging. Also @MelodyShih 's benchmark figures include the memcopies.

ahbarnett commented 2 years ago

Hi everyone. THanks Melody for your C++ test verifications. It's weird that the high-density case is a factor nearly 4 different per-NU-point.

Johannes:

@MelodyShih https://github.com/MelodyShih @ahbarnett https://github.com/ahbarnett Do the tests here https://github.com/flatironinstitute/cufinufft/blob/master/python/cufinufft/tests/ already follow https://github.com/flatironinstitute/cufinufft/tree/master/test ? If not I can write the python tests that match the pure C ones and profile them.

No, the Py demos are merely for correctness and were done by Joakim for our paper submission 1 year ago. Any Py-based timing tests, that match a C++-based test like Melody just did, would be amazing to add to the Py tests dir, and would help resolve some of these speed anomalies!

Thanks all for the collaboration - Alex

janden commented 2 years ago

Any Py-based timing tests, that match a C++-based test like Melody just did, would be amazing to add to the Py tests dir, and would help resolve some of these speed anomalies!

Agreed. I'll make an issue.

MelodyShih commented 2 years ago

Hi all, following it the python code that I wrote (adapting from the example2d1many.py) to test the performance. We could probably reuse part of it for the python timing tests. I choose the problem size as in the 3D single coil case and borrow the timing function from @paquiteau 's code.

import numpy as np
from time import perf_counter

import pycuda.autoinit
from pycuda.gpuarray import GPUArray, to_gpu

from cufinufft import cufinufft

def timed(fun):
    g_start=perf_counter()
    stop = g_start
    t=[]
    while stop-g_start < 10:
        start=perf_counter()
        res = fun()
        stop =perf_counter()
        t.append(stop-start)
    if len(t) > 1:
        t.pop(0)
    return t , res

N1, N2, N3 = 128, 128, 160      # Size of uniform grid
M = 6136781                     # Number of nonuniform points
n_transf = 1                    # Number of input arrays
eps = 1e-3                      # Requested tolerance
dtype = np.float32              # Datatype (real)
complex_dtype = np.complex64    # Datatype (complex)

kx = np.random.uniform(-np.pi, np.pi, size=M)
ky = np.random.uniform(-np.pi, np.pi, size=M)
kz = np.random.uniform(-np.pi, np.pi, size=M)
c = (np.random.standard_normal((n_transf, M)) + 1j * np.random.standard_normal((n_transf, M)))

kx = kx.astype(dtype)
ky = ky.astype(dtype)
kz = kz.astype(dtype)
c  = c.astype(complex_dtype)

fk_gpu = GPUArray((n_transf, N1, N2, N3), dtype=complex_dtype)

##### Type 1 #######
type1plan = cufinufft(1, (N1, N2, N3), n_transf, eps=eps, dtype=dtype)
type1plan.set_pts(to_gpu(kx), to_gpu(ky), to_gpu(kz))
type1plan.execute(to_gpu(c), fk_gpu)
t, r = timed(lambda: type1plan.execute(to_gpu(c), fk_gpu))
print(f'3d1: {np.mean(t)*1000:9.2f}ms  ± {np.std(t)*1000:3.3f}')

##### Type 2 #######
fk = fk_gpu.get()
c_gpu = GPUArray((n_transf, M), dtype=complex_dtype)

type2plan = cufinufft(2, (N1, N2, N3), n_transf, eps=eps, dtype=dtype)
type2plan.set_pts(to_gpu(kx), to_gpu(ky), to_gpu(kz))
type2plan.execute(c_gpu, to_gpu(fk))
t, r = timed(lambda: type2plan.execute(c_gpu, to_gpu(fk)))
print(f'3d2: {np.mean(t)*1000:9.2f}ms  ± {np.std(t)*1000:3.3f}')

Here are the timing results:

On V100:

[yhs264@gv001 examples]$ python3 example3d.py
3d1:    27.56ms  ± 0.381
3d2:    18.49ms  ± 0.282

On RTX8000:

[yhs264@gv001 examples]$ python3 example3d.py
3d1:     66.18ms  ± 1.431
3d2:    170.12ms  ± 1.282

Two differences I could think of are

I compiled cufinufft and install the python wrapper via make python rather than using the precompiled distribution (I'm using CUDA 11.3.1).
The NUpts' locations affect the performance and we are not using the same ones.

I am not sure what else can cause the difference. Hopefully the results here could help and guide us to find the reason.

Best regards, Melody

MelodyShih commented 2 years ago

FYI, with #130, the timing becomes:

On V100:

[yhs264@gv001 examples]$ python3 example3d.py
3d1:     27.56ms  ± 0.699
3d2:     16.56ms  ± 0.318

On RTX8000:

[yhs264@gv001 examples]$ python3 example3d.py
3d1:     66.00ms  ± 0.338
3d2:     33.30ms  ± 2.276

paquiteau commented 2 years ago

Hi @MelodyShih , this looks very promising. could you try with the same distribution of point as ours ? Typical MRI reconstruction have an higher density of point in the center of kspace. And its the case with the provided dataset (https://perso.crans.org/comby/dataset_pysap.npy, also available in the colab notebook)

ahbarnett commented 2 years ago

You could make a concentrated NU pts distribution just by taking the sqrt of (the magnitude of) each coordinate. Or, only 1 or 2 of the coords if want less concentration.

MelodyShih commented 2 years ago

Hi @paquiteau , I tried running with your dataset. Here is the timing I got on V100:

3d2 (op):        10.25ms  ± 0.505
3d1 (adjop):     36.51ms  ± 0.426

paquiteau commented 2 years ago

Thank you @MelodyShih , so the performance lost of cufinufft would be due to not hardware-dedicated compilation ? for gpuNUFFT I used cuda integration in CMake to automatically compile for the user's architechture (see here).

chaithyagr commented 2 years ago

Thank you @MelodyShih , we are also re-compiling using CUDA11 at our end to make sure that the numbers match. I was wondering if the issue is just because of underperforming binaries at pip? Which CUDA was used to create the binaries at pip currently?

I would recommend either creating binaries specific to CUDA like cupy (cupy-cuda100, cupy-cuda110 etc) or have a way in which the binaries can be compiled on the fly like how it is done in gpuNUFFT as pointed by @paquiteau

Personally, I feel given the compile time is large for cufinufft, having separate per CUDA binary (or at least the latest CUDA binary) will be helpful perhaps? I can help if needed with a PR to generate this / have a CD to generate this.

janden commented 2 years ago

Which CUDA was used to create the binaries at pip currently?

We are currently using CUDA 10.2, IIRC.

I would recommend either creating binaries specific to CUDA like cupy (cupy-cuda100, cupy-cuda110 etc) or have a way in which the binaries can be compiled on the fly like how it is done in gpuNUFFT as pointed by @paquiteau

I think this makes sense. We did consider it, but the increased complexity (and consequent maintenance responsibility) led us to the fixed CUDA version. That being said, if there is a considerable speedup here, it may be worth looking into. I should say, however, that we are planning on moving to CUDA 11 exclusively relatively soon, so this would be mainly to support other CUDA versions.

paquiteau commented 2 years ago

Hello again, I have run some profiling tools to see what could be wrong with our setup. I manually compiled cufiNUFFT for the GPU of my machine (Quadro M4000, compute capability 5.2, cuda 11.0 ). And run a script using Nvidia Visual Profiler. I am using the same setup as before (3D radial MRI dataset, 32 coils ).

Here is the annotated screen capture of the profiling: cufigpu The raw profiling file is there: gpunufft_vs_cufi3d.zip

The two main bottleneck for cufiNUFFT are the memory transfert (they have a low troughput, and are done synchronously), and the kernel Spread_3d_Subproblem (which takes 500ms per transform, compares to the 10 ms required for the other kernels of cufinufft).

At this point this might be a gpu issue, is cufiNUFFT codebase optimized for specific compute capabilities ?

You can also see that gpuNUFFT is using asynchronous copy for the coil data, ie doing the copy of coil N while copying the data for coil N+1. (that was the work of @chaithyagr). We also have idea to make the computation faster in this multicoil setup especially in the case of data consistency operation e.g. gradient of ||Fx-y||_2^2 .

for the sake of this benchmark , I added basic support of cufinufft for pysap-mri, our main software to make research on mri reconstruction: https://github.com/paquiteau/pysap-mri/blob/pygpuNUFFT/mri/operators/fourier/non_cartesian.py

MelodyShih commented 2 years ago

Hi paquiteau, Thanks for sharing the profiling results. To answer the question about what compute capabilities we optimized for, we focus on the Volta micro-architecture and hence compute capabilities 7.0 and above.

paquiteau commented 2 years ago

I have now access to an RTX 5000 (Turing Architecture compute capability 7.5). So I ran the bench mark again, and this time I am using nvidia nsight system (nvvp is deprecated/ill supported for recent GPUs). I also limited myself to only use single coil data (cufinufft does not use concurent computation on coils dimension anyway)

The problem still persist, even if we compare in term of raw kernel performance

This is gpuNUFFT timeline: Screenshot from 2022-02-24 17-41-57

This is cufinufft timeline: Screenshot from 2022-02-24 17-41-13

Global view of the benchmark (first initialisation, then computation of gpuNUFFT, then cufinufft init (not visible, its fast) and computation) Screenshot from 2022-02-24 17-43-55

Note that the time axis for the two first capture is the same scale, but due to github import the scale of gpuNUFFT is zoomed in.

The raw profiling file is available here: https://perso.crans.org/comby/Report%204.nsys-rep

and the script use is there: https://github.com/paquiteau/notebooks/blob/master/mri_cufi.py

MelodyShih commented 2 years ago

hmm I'm not sure what causes the difference comparing to the timing on V100. How many stream multiprocessors does RTX5000 have? Btw, are you using the latest codes? There is a "bug" fixed that should help with the timing for type 2 transforms (not for type 1 though).

paquiteau commented 2 years ago

Here is the deviceQuery for the RTX 5000:

./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "Quadro RTX 5000"
  CUDA Driver Version / Runtime Version          11.4 / 11.2
  CUDA Capability Major/Minor version number:    7.5
  Total amount of global memory:                 16122 MBytes (16905338880 bytes)
  (48) Multiprocessors, ( 64) CUDA Cores/MP:     3072 CUDA Cores
  GPU Max Clock rate:                            1815 MHz (1.81 GHz)
  Memory Clock rate:                             7001 Mhz
  Memory Bus Width:                              256-bit
  L2 Cache Size:                                 4194304 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total shared memory per multiprocessor:        65536 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1024
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 3 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Managed Memory:                Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 213 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 11.4, CUDA Runtime Version = 11.2, NumDevs = 1
Result = PASS

Yes, I saw you merged PR and used an up-to-date version of cufinufft.

paquiteau commented 2 years ago

Hello again, In the past weeks, I have been focusing on creating a wrapper around cufinufft dedicated to MRI reconstruction, in python, using cupy: https://github.com/paquiteau/mri-cufinufft. It add basic density compensation and Sensitivity maps supports.

The conclusion to this story is for me the following:

In raw performance, cufinufft outperforms gpuNUFFT in 2D, and equals it in 3D for high acceleration factor.
cufinufft kernels have low occupancy factor (only 50% of the available warps are used basically)
In multicoil settings, gpuNUFFT takes usage of concurrency (using async copy and runs with streams), which makes it a real challenger for cufinufft.
In an iterative convex optimization solver (eg FISTA, POGM) the quality of the reconstruction are identical, In an unregularized setup, cufinufft as a small lead (due, I think, to the use of ES kernel instead of KB for cufinufft)

gpuNUFFT is still buggy and unmaintained, cufinufft is easily extensible, and provide great customization. So despite the performance loss, I will switch to use cufinufft, and you may see a PR someday to improve the performance.

Below you will find a detailled benchmark I did using an RTX 5000.

In the following charts the labels reads as follow:

eps : the precision parameter for cufinufft
gpu-single: data is on device, plan is created/setup/used/deleted for each transform run
gpu-persist: data is on device, plan is initialized and reused later.
cpu-single: data is on host, plan is created/setup/used/deleted for each transform run
cpu-persist: data is on host, plan is initialized and reused later. For the Smaps I tested the following setup:
Calibless: No sensitivity maps
Smaps: Smaps are dynamically loaded on the device (this save memory)
Smaps Cached: A copy of the smaps array is kept on the gpu at all time

And as always, op=type 2 , adj_op=type 1

2D Benchmark, 512x512, Single Channel, Radial Trajectory

2D Benchmark, 512x512, 32 channels, Radial Trajectory

3D Benchmark, 128x128x160, Single channels, Radial Trajectory

3D Benchmark, 128x128x160, 32 channels, Radial Trajectory

flatironinstitute / cufinufft

gpuNUFFT is faster than cufiNUFFT in 3D #126

NUpts = 8x #Upts