DTolm / VkFFT

Vulkan/CUDA/HIP/OpenCL/Level Zero/Metal Fast Fourier Transform library
MIT License
1.49k stars 88 forks source link

Titan V FP32 & FP64 results.. #10

Closed oscarbg closed 3 years ago

oscarbg commented 3 years ago

Hi, just tested your awesome library with a Titan V (Volta) which has only 2x slower FP64 vs FP32.. using NVIDIA 455.26.01 and CUDA 10.1 (can update CUDA SDK to 11.1 if requested: faster CUFFT?) Titan V: VKFFT single: titanv.txt CUFFT single: titanvcufft.txt VKFFT double: titanvfp64.txt CUFFT double: titanvcufftdouble.txt

roughly speaking your library is well optimized for double precision as it slowdowns 2x and similar score to CUFFT..

some perf issues in vkFFT in double precision vs single compared to CUFFT like for example in 4kx4kx8 case:

VkFFT System: 4096x4096x8 Buffer: 2048 MB avg_time_per_step: 75.931 ms cuFFT System: 4096x4096x8 Buffer: 2048 MB avg_time_per_step: 58.844 ms

note similar "big" double precision FFTs are not affected:

VkFFT System: 2048x256x256 Buffer: 2048 MB avg_time_per_step: 48.205 ms std_error: 0.329 batch: 1 cuFFT System: 2048x256x256 Buffer: 2048 MB avg_time_per_step: 45.925 ms std_error: 0.215 batch: 1

as you can see from fp32 case:

cuFFT System: 4096x4096x8 Buffer: 1024 MB avg_time_per_step: 29.235 ms std_error: 0.179 batch: 4 VkFFT System: 4096x4096x8 Buffer: 1024 MB avg_time_per_step: 29.923 ms std_error: 0.613 batch: 4

4kx4kx8 was performing similar to CUFFT in fp32 case : 29ms .. and CUFFT scales well from 29ms to 58ms..

as said 2kx256x256 scales well to fp64: goes from 23-24ms in fp32 to 46-48 ms in fp64 case

VkFFT System: 2048x256x256 Buffer: 1024 MB avg_time_per_step: 24.031 ms std_error: 0.662 batch: 4 benchmark: 43633 cuFFT System: 2048x256x256 Buffer: 1024 MB avg_time_per_step: 23.424 ms std_error: 0.085 batch: 4 benchmark: 44765

Plots: [EDIT: in next post]

oscarbg commented 3 years ago

Attaching Plots: Single: benchmarktitansingle Double: benchmarktitandouble

hoping to test and provide FP16 results as soon as you publish support for it.. :-)

oscarbg commented 3 years ago

Now I see double commit:

I have unreleased version which uses polynomial expansion (degree=20) to calculate sincos on the GPU, it is 3 times slower than LUT on consumer level GPUs

are you willing to upload or share this variant so I can test if on TItan V this variant is faster and can beat CUFFT in some cases?

thanks..

DTolm commented 3 years ago

Hello, big thanks for this report, certainly shows what I need to investigate! I can explain the 4kx4kx8 result discrepancy now and that it will be alleviated in one of the next updates. Right now, the switching between single pass FFT/four step FFT happens at 2k sequence length in x direction and at 1k length in y and z in double. The 4k x 4k x 1, 4k x 4k x 8, 2k x 2k x 1 are the boundary cases where this switch plays major role (and 2k x 256 x 256 is fine in this regard). The register overutilization applied to these sizes will bring results to the same level (the idea is similar to 8k and 16k folders in single precision).

I attach shaders that compute sincos instead of using LUT. To use them, simply replace double folder in shaders directory with the one from the archive. double.zip

oscarbg commented 3 years ago

Hi, thanks for detailed explanation on what's going on and also sharing the sincos shaders.. I attach the results using it:

titanvfp64sincos.txt

overall score the same, but some results see big gains specially (so rivalling or beating CUFFT): VkFFT System: 1024x1024x64 Buffer: 1024 MB avg_time_per_step: 31.005 ms VkFFT System: 1048576x64x1 Buffer: 1024 MB avg_time_per_step: 26.909 ms vs older: VkFFT System: 1024x1024x64 Buffer: 1024 MB avg_time_per_step: 35.660 ms VkFFT System: 1048576x64x1 Buffer: 1024 MB avg_time_per_step: 34.933 ms

DTolm commented 3 years ago

This speedup on bigger systems (and slowdown for smaller, like 32x32 and 64x64) mostly indicates that smaller systems are more compute-bound (we can afford to spend bandwidth for LUT uploads) while bigger systems are bandwidth-bound (similar to float, it is cheaper to just recalculate sincos on-chip than spend time on LUT upload, as LUT size increases with the FFT dimenstion). Note, I have not spend much time on optimizing LUT layout and the way it is uploaded (it is done as a simple storage buffer). Guess a mixture of both algorithms will produce best results, though this will only work for devices suited for double precision. Consumer level GPUs are limited to LUT, due to their low dp core count.

oscarbg commented 3 years ago

yep.. interesting experiment anyway.. will keep thread open to submit FP16 benchmarks here once code is ready..

oscarbg commented 3 years ago

benched new FP16 support on Titan V.. now I compiled on Windows so I can overclock a little too :-)

Titan V Windows: fp16.txt

overclocked results: fp16oc.txt

one of the greatest speedups FP16 vs FP32 is (almost 3X faster): FP16: VkFFT System: 1024x1024x64 Buffer: 256 MB avg_time_per_step: 6.411 ms std_error: 0.019 batch: 16 benchmark: 81781 FP32: VkFFT System: 1024x1024x64 Buffer: 512 MB avg_time_per_step: 18.566 ms std_error: 0.243 batch: 8 benchmark: 28239

great work!

closing now!

DTolm commented 3 years ago

It is really good to hear that FP16 achieves theoretical 2x speedup on the system. It is also interesting to see how cuFFT performs in this environment, because I am truly unsure what happens with it on consumer level GPUs (it scales unpredictably). The precision I got with cuFFT in half was also very poor. The 1024x1024x64 result was an outlier in the FP32 benchmark (28k vs 45k score) and I have no good explanation why - on other systems this FFT scales well.