Open AtlantaPepsi opened 3 years ago
@AtlantaPepsi Hello, I'm trying to run the Cxx11 cuda and SYCL programs in x86_64 machine.But I'm unfamiliar with the make.defs config. Can you help me give you more precise make.defs config about cuda and SYCL? Thank you very much!
hi @wangzy0327 , what do you mean by "more precise make.defs config", are you having a build error, or does the executable fail to produce similar results with existing flags in cuda/oneapi?
This is the build error. I have installed libboost-dev software.
g++-11 -std=gnu++17 -pthread -O3 -mtune=native -ffast-math -Wall -Wno-ignored-attributes -Wno-deprecated-declarations -DPRKVERSION="2020" stencil-ranges.cc -DUSE_BOOST_IRANGE -I/usr/include/boost/ -DUSE_RANGES -o stencil-ranges
In file included from stencil-ranges.cc:66:
stencil_ranges.hpp: In function ‘void star1(int, prk::vector<double>&, prk::vector<double>&)’:
stencil_ranges.hpp:2:16: error: ‘ranges’ has not been declared
As observed here, CUDA stencil operation appears to be much slower than DPC++ across all block sizes on NVIDIA device. I also ran the problem (8000 grid points, 100 iteration) on V100 and results are as following:
Although the difference is not as bad as @jeffhammond's results which were obtained on DGX-A100, CUDA is still quite a bit slower compared to SYCL on either platform. Here are the simple build commands I used:
(cuda 11.1)
nvcc -g -O3 -std=c++17 --gpu-architecture=sm_70 -D_X86INTRIN_H_INCLUDED stencil-cuda.cu -o stencil-cuda
(clang 13.0.0, intel/llvm commitf126512
)clang++ -g -O3 -std=c++17 -fsycl -fsycl-unnamed-lambda -fsycl-targets=nvptx64-nvidia-cuda-sycldevice stencil-sycl.cc -o stencil-sycl-oneapi
Upon a quick inspection with
nvprof
, there seems to be no additional overhead outside the two computational kernels (add and star).Furthermore, add and star roughly splits the total avg runtime above in CUDA, but not for SYCL (at least on V100). While star takes roughly the same time, add on CUDA is about 50% slower around optimal block size (i.e. > 8). Considering the atomicity of add kernel, I reckon this slowdown probably shouldn't be attributed to problematic memory access patterns?
It'd be interesting to see if we can observe similar CUDA drawback on other kernels. For now I will be looking into the PTX binaries, perhaps I could spot the exact instructions that incurred this slowdown on CUDA.