Closed makortel closed 1 week ago
I'd be curious if the CPU usage from top matches the values from https://github.com/cms-patatrack/pixeltrack-standalone/pull/324 ?
Let me give #324 a try.
I re-ran subset of the test cases with #324 and the ps
-based monitoring, and plotted the thread efficiency (i.e. the number reported by #324)
In majority of the cases #324 gives a slightly larger value, but I'd say overall the two methods are in rather good agreement.
Thanks for the check ! I'll extend #324 to all backends and merge it.
This idea was integrated into CMSSW in https://github.com/cms-sw/cmssw/pull/44901, so I think this PR can be closed.
This PR prototypes
edm::async()
planned in https://github.com/cms-sw/cmssw/issues/29188, and use it incudadev
instead ofcudaStreamAddCallback()
.I made a preliminary test on RTX 2080 (single ~1 minute run on each case, CPU utilization is peak of those given reported by
ps
checked every 10 s). I tested three cases,cudadev
,cudadev --transfer
, andcudadev --histogram
to see if behavior is different with more device-to-host transfers (i.e. synchronization calls) and CPU work.I tried first to use
cudaStreamSynchronize()
with each ofcudaDeviceScheduleSpin
,cudaDeviceScheduleYield
, andcudaDeviceScheduleBlockingSync
(forcudaSetDeviceFlags()
), but didn't see any substantial difference between them: they give a bit higher throughput at low number of threads, but come with significant use of CPU compared tocudaStreamAddCallback()
.Then I tried
cudaEventSynchronize()
, but in first test that didn't seem to improve (I didn't even bother to measure it). AddingcudaEventBlockingSync
to the CUDA event creation flags finally lead to significantly lower CPU utilization, in case of minimal-transfercudadev
much lower than withcudaStreamAddCallback()
. The throughput on low number of CPU threads takes up to 15 % hit compared tocudaStreamAddCallback()
andcudaStreamSynchronize()
(depending on the test case).Given that I didn't repeat the measurements, the uncertainty may be high, and therefore I think the measurements should be repeated. I'm planning to do that, but didn't want to hold the prototype for those.