Closed rupanshusoi closed 8 months ago
This problem did not reproduce on 2 nodes. Here is a profile. Note it has three equal plateaus of GPU utilization, as expected.
Rupanshu and I noticed that the 4-node profile has a very strange mapping during the third wrapper task that results in excessive copies. This appears to be what is slowing that part of the program down.
Enabling index launches fixed the mapping, which in turn fixed this issue.
I'm running a modified version of Stencil on Perlmutter. I'm seeing a weird slowdown, here is a profile.
In this configuration, there are two wrapper tasks, and the first one executes twice. There are consequently two plateaus of GPU utilization due to the first wrapper task in the profile; those are fine.
The issue is they should've been followed by another plateau corresponding to the second wrapper task. But GPU utilization drops markedly in the second wrapper task: the gap between successive invocations of GPU kernels increases from 2 ms (in the first wrapper task), to about 40 ms (in the second). The utility processors and channel are not overloaded, so I don't know what is causing this slowdown.
Note that the three spikes in CPU utilization are just the copies; they are expected.