c3sr / comm_scope

NUMA-aware multi-CPU multi-GPU data transfer benchmarks
https://github.com/c3sr/scope
Apache License 2.0
21 stars 3 forks source link

use cudaEventWaitStream for multi-device duplex transfers #25

Closed cwpearson closed 5 years ago

cwpearson commented 5 years ago

CUDA C Programming Guide §3.2.6.3 cudaEventRecord() will fail if the input event and stream are associated with two different devices cudaEventElapsedTime will fail if the two input events are associated with different devices

we solved this problem by using the host wall time between launch both jobs and synchronzing. We could instead time events in one stream, and those events will wrap the transfer as well as a wait on the other stream to finish. Then we don't end up measuring the cost of two stream syncrhronizes on the host.

cwpearson commented 5 years ago

Done for coherence gpu/gpu in 87d69e04e12015e19488be1396ea539cc931caa4

cwpearson commented 5 years ago

Done for prefetch gpu/gpu in 03abea0a47136b1bdeb9f10b182eb120c02480a6

cwpearson commented 5 years ago

Done for zero-copy gpu/gpu in 156d014e7b43d809ec3084be66b2b7a673cebead