Open rahulb1218 opened 3 weeks ago
Should the first loop be adding to array_view_copy
instead of array_view
?
Is this the code that was used to benchmark the slowdown or is this a reproducer based on some other code where you saw the slowdown?
Have you tried running a kernel before the first kernel to capture first kernel overheads?
Have you tried running the kernels multiple times to look at how having a warm cache affects performance?
@rahulb1218 can you post the performance numbers, also I just realized it might be worth looking at the kernels through the NVIDIA profiler. As @MrBurmark RAJA has a high initial kernel launch overhead due to the way streams are setup.
Sure, I can report what Caliper recorded. RAJA_View_Kernel: 17.02 seconds. No_View_Kernel: 2.45 seconds.
@rahulb1218 , I don't have access to pascal but on lassen things check out fine, could you try the following example and share what you see? https://github.com/LLNL/RAJA/pull/1728 I added an empty kernel to avoid measuring stream initialization.
Describe the bug
Slowdown observed with Raja View when compared to accessing memory directly
To Reproduce
Expected behavior
Both kernels should take about the same time to run.
Compilers & Libraries (please complete the following information):