Closed fwyzard closed 2 years ago
@waredjeb could you double check that this recovers the performance of ./alpaka --cuda
and brings it again within few percent of that ./cuda
?
@waredjeb could you double check that this recovers the performance of ./alpaka --cuda and brings it again within few percent of that ./cuda ?
Looks like it recovers the performance of ./alpaka --cuda
!
The upstream PR (https://github.com/alpaka-group/alpaka/pull/1782) has been merged into alpaka, so I've cleaned up these changes.
With the latest changes, the performance of alpaka --cuda
is back to ~97% that of native ./cuda
:
fwyzard@devfu-c2b04-44-01.cms:/data/user/fwyzard/pixeltrack-standalone$ CUDA_VISIBLE_DEVICES=0 numactl -N 0 ./cuda --numberOfThreads 16 --numberOfStreams 16 --maxEvents 10000 &> /dev/null; for N in 1 2 3 4; do CUDA_VISIBLE_DEVICES=0 numactl -N 0 ./cuda --numberOfThreads 16 --numberOfStreams 16 --maxEvents 10000 | grep throughput; done
Processed 10000 events in 9.324276e+00 seconds, throughput 1072.47 events/s, CPU usage per thread: 67.0%
Processed 10000 events in 9.348212e+00 seconds, throughput 1069.72 events/s, CPU usage per thread: 66.7%
Processed 10000 events in 9.426962e+00 seconds, throughput 1060.79 events/s, CPU usage per thread: 66.0%
Processed 10000 events in 9.440113e+00 seconds, throughput 1059.31 events/s, CPU usage per thread: 66.1%
vs
fwyzard@devfu-c2b04-44-01.cms:/data/user/fwyzard/pixeltrack-standalone$ CUDA_VISIBLE_DEVICES=0 numactl -N 0 ./alpaka --cuda --numberOfThreads 16 --numberOfStreams 16 --maxEvents 10000 &> /dev/null; for N in 1 2 3 4; do CUDA_VISIBLE_DEVICES=0 numactl -N 0 ./alpaka --cuda --numberOfThreads 16 --numberOfStreams 16 --maxEvents 10000 | grep throughput; done
Processed 10000 events in 9.610436e+00 seconds, throughput 1040.54 events/s, CPU usage per thread: 66.3%
Processed 10000 events in 9.681004e+00 seconds, throughput 1032.95 events/s, CPU usage per thread: 65.9%
Processed 10000 events in 9.711554e+00 seconds, throughput 1029.7 events/s, CPU usage per thread: 66.1%
Processed 10000 events in 9.698732e+00 seconds, throughput 1031.06 events/s, CPU usage per thread: 66.0%
Add non-cached, pinned, host buffers, and use them instead of pageable memory to recover a performance close to that of native CUDA.