Open hyschive opened 1 year ago
After testing and discuss with @hsinhaoHHuang, we conclude that:
cuFFTMp
is not supported on our GPU card (GeForce GPU); the hardware requirement is NVIDIA data center GPUs, of the Volta, Ampere or Hopper architecture. See here.
For other option, we tried other package: GPU-FFT
on spock
, and compared its performance with FFTW3
; however, performance of GPU-FFT
is not very ideal, suggesting that using FFTW3
might be sufficient.
800X800X800 , single node |
Setup | Memory Allocation Time (ms) | Forward FFT Time(ms) | Backward FFT Time(ms) | Answer-Copy Time (ms) |
---|---|---|---|---|---|
FFTW3 (32 Ranks) |
0.0596 | 277.2239 | 286.4178 | -- | |
GPU-FFT (1 RTX 3080Ti) |
352.9170 | 88.3235 | 56.6548 | 78.5470 |
Though FFT computation on GPU is fast, Answer-Copy Time
is expensive, making its performance roughly equal to FFTW3
.
1024X1024X1024 , 2 nodes |
Setup | Memory Allocation Time (ms) | Forward FFT Time(ms) | Backward FFT Time(ms) | Answer-Copy Time (ms) |
---|---|---|---|---|---|
FFTW3 (64 Ranks) |
0.2304 | 414.2701 | 348.4204 | -- | |
GPU-FFT (2 RTX 3080Ti) |
385.0763 | 689.6755 | 262.1999 | 347.5528 |
Even for the FFT computation, using GPU does not have advantage over using FFTW3
.
Conclusion
cuFFTMp
or better performance for GPU-FFT
), and suitable code with minimum data transfer (not sure whether GAMER
guarantees), to get better performance for executing FFT on GPU card(s). @koarakawaii @hsinhaoHHuang Are you comparing single- or double-precision performance?
@hyschive: we used single precision for both cases above.
GAMER currently only supports computing FFT on CPUs. It will be great to support multi-GPU, multi-node FFT using
cuFFTMp
.