cms-patatrack / pixeltrack-standalone

Standalone Patatrack pixel tracking
Apache License 2.0
17 stars 35 forks source link

[cudadev][RFC] Prototype edm::async() and replace cudaStreamAddCallback with it #321

Closed makortel closed 1 week ago

makortel commented 2 years ago

This PR prototypes edm::async() planned in https://github.com/cms-sw/cmssw/issues/29188, and use it in cudadev instead of cudaStreamAddCallback().

I made a preliminary test on RTX 2080 (single ~1 minute run on each case, CPU utilization is peak of those given reported by ps checked every 10 s). I tested three cases, cudadev, cudadev --transfer, and cudadev --histogram to see if behavior is different with more device-to-host transfers (i.e. synchronization calls) and CPU work.

I tried first to use cudaStreamSynchronize() with each of cudaDeviceScheduleSpin, cudaDeviceScheduleYield, and cudaDeviceScheduleBlockingSync (for cudaSetDeviceFlags()), but didn't see any substantial difference between them: they give a bit higher throughput at low number of threads, but come with significant use of CPU compared to cudaStreamAddCallback().

Then I tried cudaEventSynchronize(), but in first test that didn't seem to improve (I didn't even bother to measure it). Adding cudaEventBlockingSync to the CUDA event creation flags finally lead to significantly lower CPU utilization, in case of minimal-transfer cudadev much lower than with cudaStreamAddCallback(). The throughput on low number of CPU threads takes up to 15 % hit compared to cudaStreamAddCallback() and cudaStreamSynchronize() (depending on the test case).

image image image

Given that I didn't repeat the measurements, the uncertainty may be high, and therefore I think the measurements should be repeated. I'm planning to do that, but didn't want to hold the prototype for those.

fwyzard commented 2 years ago

I'd be curious if the CPU usage from top matches the values from https://github.com/cms-patatrack/pixeltrack-standalone/pull/324 ?

makortel commented 2 years ago

Let me give #324 a try.

makortel commented 2 years ago

I re-ran subset of the test cases with #324 and the ps-based monitoring, and plotted the thread efficiency (i.e. the number reported by #324) image image image

In majority of the cases #324 gives a slightly larger value, but I'd say overall the two methods are in rather good agreement.

fwyzard commented 2 years ago

Thanks for the check ! I'll extend #324 to all backends and merge it.

makortel commented 1 week ago

This idea was integrated into CMSSW in https://github.com/cms-sw/cmssw/pull/44901, so I think this PR can be closed.