Closed tkoskela closed 2 years ago
Intel VTune does not support NVidia devices
nvidia nsys is able to profile the sycl kernel and produces meaningful-looking results. I used it simply by running nsys profile ./main <args>
. It produces a .qdrep
file which can be opened with NVidia Nsight Systems
I've copied sycl and cuda profiling reports of test2
into ~/rds/rds-dirac-dr004/openqcd/nsys_reports/
After a discussion with Ioannis Zacharoudiou, I profiled the code using the NSight Compute profiler ncu
. The command used to profile with ncu
was
ncu --set=full -f -o profile --target-processes all ./EXECUTABLE
Overall ncu
worked fairly well with dpcpp code. The main issue is that since the kernels executed on the GPU are generated from lambda functions in SYCL, the function names are not retained as kernel names. Instead, the profiled kernels have generic names that can be hard to decipher if the code contains multiple kernels. As a workaround, I noticed that the kernels have names ending in instance i
, where i
is a running index. Profiling each kernel individually, it was possible to match the kernel with the source code lines being executed. To profile a single kernel, I used
ncu --set=full -f -o sycl_nvidia_gpu_7 --target-processes all --kernel-name-base demangled --kernel-name regex:"instance 7" ./EXECUTABLE
the CUDA kernels in OpenQCD perform better than the SYCL kernels we generated. We have summarized the preliminary results in table 3. The difference varies between the kernels we studied, from a factor of 2 to a factor of 6. We note that the compute throughput seems categorically lower by roughly a factor of 2, while the memory throughput is comparable in the mulpauli
and doe
kernels and lower by a factor of 2 in the deo
kernel. NSight Compute suggests the kernels are bottlenecked by memory throughput. A significant difference between the memory use of the CUDA and SYCL kernels is the CUDA kernels are only using Global memory while SYCL kernels are using both Global and Local memory. The roofline analysis gives an AI of 0.37 for the SYCL code and 1.15 for the CUDA code.
ncu
reports for reference
Next step after after UCL/openqcd-oneapi#16 and UCL/openqcd-oneapi#15 are closed
Possible tools