UCL / openqcd-oneapi

GNU General Public License v2.0
0 stars 0 forks source link

Profile SYCL dpcpp code and compare with CUDA #14

Closed tkoskela closed 2 years ago

tkoskela commented 2 years ago

Next step after after UCL/openqcd-oneapi#16 and UCL/openqcd-oneapi#15 are closed

Possible tools

tkoskela commented 2 years ago

Intel VTune does not support NVidia devices

tkoskela commented 2 years ago

nvidia nsys is able to profile the sycl kernel and produces meaningful-looking results. I used it simply by running nsys profile ./main <args>. It produces a .qdrep file which can be opened with NVidia Nsight Systems

I've copied sycl and cuda profiling reports of test2 into ~/rds/rds-dirac-dr004/openqcd/nsys_reports/

tkoskela commented 2 years ago

After a discussion with Ioannis Zacharoudiou, I profiled the code using the NSight Compute profiler ncu. The command used to profile with ncu was

ncu --set=full -f -o profile --target-processes all ./EXECUTABLE

Overall ncu worked fairly well with dpcpp code. The main issue is that since the kernels executed on the GPU are generated from lambda functions in SYCL, the function names are not retained as kernel names. Instead, the profiled kernels have generic names that can be hard to decipher if the code contains multiple kernels. As a workaround, I noticed that the kernels have names ending in instance i, where i is a running index. Profiling each kernel individually, it was possible to match the kernel with the source code lines being executed. To profile a single kernel, I used

ncu --set=full -f -o sycl_nvidia_gpu_7 --target-processes all --kernel-name-base demangled --kernel-name regex:"instance 7" ./EXECUTABLE
tkoskela commented 2 years ago

the CUDA kernels in OpenQCD perform better than the SYCL kernels we generated. We have summarized the preliminary results in table 3. The difference varies between the kernels we studied, from a factor of 2 to a factor of 6. We note that the compute throughput seems categorically lower by roughly a factor of 2, while the memory throughput is comparable in the mulpauli and doe kernels and lower by a factor of 2 in the deo kernel. NSight Compute suggests the kernels are bottlenecked by memory throughput. A significant difference between the memory use of the CUDA and SYCL kernels is the CUDA kernels are only using Global memory while SYCL kernels are using both Global and Local memory. The roofline analysis gives an AI of 0.37 for the SYCL code and 1.15 for the CUDA code.

Detailed ncu reports for reference

CUDA

cuda_ncu_details

DPCPP

sycl_ncu_details