Improve multithread GPU performance by removing subtle device sync

amandalund commented 2 months ago

@esseivaju found in his profiling that in our "extend from secondaries" action there was a device synchronization:

that’s what I was saying about having to synchronize each step. You can see that in the extend-from-secondaries action::remove-if-alive scope we have to synchronize with the device. So, we cannot push the last few kernels for that step or move on to the next step before the GPU has finished executing the current step. You can see in the top line there is a gap of ~1ms in the GPU utilization at the end of a step / between steps

@sethrj questioned recently whether the way we are copying the result of a scan to the host might be synchronous, and I think that is the culprit. This adds a little helper class to copy a single value to the host and does the copy of the scan result asynchronously.

I ran the regression problems with merge_events off; here’s the speedup in the throughput with this change relative to develop: rel-throughput-remove-sync

esseivaju commented 2 months ago

I'm pretty sure the copy is still synchronous because we're copying back to pageable memory, we'd have to allocate pinned for the counters to make the copy truly asynchronous. From your plot, it's still a nice improvement as the copy is now issued in a stream instead of using the default stream which synchronizes with the device! 👍

sethrj commented 2 months ago

Good lord that's a shocking improvement. I'll see what frontier does...

amandalund commented 2 months ago

Yes sorry that's right, not fully asynchronous, but no longer on the default stream.

sethrj commented 2 months ago

@amandalund Frontier shows no performance improvement at all, unfortunately. @esseivaju would you be able to replicate this on Perlmutter?

amandalund commented 2 months ago

@sethrj just making sure, did you update omp_threads in the script (right now it's set to 1 when running on the device)? I guess you should still see improvement with celer-g4 either way though...

sethrj commented 2 months ago

Oh, this was with the default values. So I guess that's only using 1 CPU per GPU so it shouldn't show much speedup. Sorry, I didn't even notice merge_events being off 😅

celeritas-project / celeritas

Improve multithread GPU performance by removing subtle device sync #1405