Closed amandalund closed 2 months ago
I'm pretty sure the copy is still synchronous because we're copying back to pageable memory, we'd have to allocate pinned for the counters to make the copy truly asynchronous. From your plot, it's still a nice improvement as the copy is now issued in a stream instead of using the default stream which synchronizes with the device! 👍
Good lord that's a shocking improvement. I'll see what frontier does...
Yes sorry that's right, not fully asynchronous, but no longer on the default stream.
@amandalund Frontier shows no performance improvement at all, unfortunately. @esseivaju would you be able to replicate this on Perlmutter?
@sethrj just making sure, did you update omp_threads
in the script (right now it's set to 1 when running on the device)? I guess you should still see improvement with celer-g4 either way though...
Oh, this was with the default values. So I guess that's only using 1 CPU per GPU so it shouldn't show much speedup. Sorry, I didn't even notice merge_events
being off 😅
@esseivaju found in his profiling that in our "extend from secondaries" action there was a device synchronization:
@sethrj questioned recently whether the way we are copying the result of a scan to the host might be synchronous, and I think that is the culprit. This adds a little helper class to copy a single value to the host and does the copy of the scan result asynchronously.
I ran the regression problems with
merge_events
off; here’s the speedup in the throughput with this change relative to develop: