ParRes / Kernels

This is a set of simple programs that can be used to explore the features of a parallel platform.
https://groups.google.com/forum/#!forum/parallel-research-kernels
Other
411 stars 109 forks source link

CUDA stencil inefficiency (compared to SYCL) #590

Open AtlantaPepsi opened 3 years ago

AtlantaPepsi commented 3 years ago

As observed here, CUDA stencil operation appears to be much slower than DPC++ across all block sizes on NVIDIA device. I also ran the problem (8000 grid points, 100 iteration) on V100 and results are as following:

Block size CUDA Rate (MF/s) SYCL Rate (MF/s) CUDA Avg time (s) SYCL Avg time (s)
1 12375.8 12367.4 0.180636 0.18076
2 49495.2 49439.3 0.0451665 0.0452175
4 197629 197099 0.0113117 0.0113421
8 487579 575118 0.00458494 0.00388707
16 571705 696173 0.00391027 0.00321116
32 478201 684394 0.00467486 0.00326643

Although the difference is not as bad as @jeffhammond's results which were obtained on DGX-A100, CUDA is still quite a bit slower compared to SYCL on either platform. Here are the simple build commands I used:

(cuda 11.1)nvcc -g -O3 -std=c++17 --gpu-architecture=sm_70 -D_X86INTRIN_H_INCLUDED stencil-cuda.cu -o stencil-cuda (clang 13.0.0, intel/llvm commitf126512)clang++ -g -O3 -std=c++17 -fsycl -fsycl-unnamed-lambda -fsycl-targets=nvptx64-nvidia-cuda-sycldevice stencil-sycl.cc -o stencil-sycl-oneapi

Upon a quick inspection with nvprof, there seems to be no additional overhead outside the two computational kernels (add and star).

Furthermore, add and star roughly splits the total avg runtime above in CUDA, but not for SYCL (at least on V100). While star takes roughly the same time, add on CUDA is about 50% slower around optimal block size (i.e. > 8). Considering the atomicity of add kernel, I reckon this slowdown probably shouldn't be attributed to problematic memory access patterns?

It'd be interesting to see if we can observe similar CUDA drawback on other kernels. For now I will be looking into the PTX binaries, perhaps I could spot the exact instructions that incurred this slowdown on CUDA.

wangzy0327 commented 1 year ago

@AtlantaPepsi Hello, I'm trying to run the Cxx11 cuda and SYCL programs in x86_64 machine.But I'm unfamiliar with the make.defs config. Can you help me give you more precise make.defs config about cuda and SYCL? Thank you very much!

AtlantaPepsi commented 1 year ago

hi @wangzy0327 , what do you mean by "more precise make.defs config", are you having a build error, or does the executable fail to produce similar results with existing flags in cuda/oneapi?

wangzy0327 commented 1 year ago

This is the build error. I have installed libboost-dev software.

g++-11 -std=gnu++17 -pthread -O3 -mtune=native -ffast-math -Wall  -Wno-ignored-attributes -Wno-deprecated-declarations -DPRKVERSION="2020" stencil-ranges.cc -DUSE_BOOST_IRANGE -I/usr/include/boost/ -DUSE_RANGES -o stencil-ranges
In file included from stencil-ranges.cc:66:
stencil_ranges.hpp: In function ‘void star1(int, prk::vector<double>&, prk::vector<double>&)’:
stencil_ranges.hpp:2:16: error: ‘ranges’ has not been declared