cms-patatrack / pixeltrack-standalone

Standalone Patatrack pixel tracking
Apache License 2.0
17 stars 35 forks source link

[alpaka] add non-cached pinned host buffers #374

Closed fwyzard closed 2 years ago

fwyzard commented 2 years ago

Add non-cached, pinned, host buffers, and use them instead of pageable memory to recover a performance close to that of native CUDA.

fwyzard commented 2 years ago

@waredjeb could you double check that this recovers the performance of ./alpaka --cuda and brings it again within few percent of that ./cuda ?

waredjeb commented 2 years ago

@waredjeb could you double check that this recovers the performance of ./alpaka --cuda and brings it again within few percent of that ./cuda ?

Looks like it recovers the performance of ./alpaka --cuda !

image

fwyzard commented 2 years ago

The upstream PR (https://github.com/alpaka-group/alpaka/pull/1782) has been merged into alpaka, so I've cleaned up these changes.

fwyzard commented 2 years ago

With the latest changes, the performance of alpaka --cuda is back to ~97% that of native ./cuda:

fwyzard@devfu-c2b04-44-01.cms:/data/user/fwyzard/pixeltrack-standalone$ CUDA_VISIBLE_DEVICES=0 numactl -N 0 ./cuda --numberOfThreads 16 --numberOfStreams 16 --maxEvents 10000 &> /dev/null; for N in 1 2 3 4; do CUDA_VISIBLE_DEVICES=0 numactl -N 0 ./cuda --numberOfThreads 16 --numberOfStreams 16 --maxEvents 10000 | grep throughput; done
Processed 10000 events in 9.324276e+00 seconds, throughput 1072.47 events/s, CPU usage per thread: 67.0%
Processed 10000 events in 9.348212e+00 seconds, throughput 1069.72 events/s, CPU usage per thread: 66.7%
Processed 10000 events in 9.426962e+00 seconds, throughput 1060.79 events/s, CPU usage per thread: 66.0%
Processed 10000 events in 9.440113e+00 seconds, throughput 1059.31 events/s, CPU usage per thread: 66.1%

vs

fwyzard@devfu-c2b04-44-01.cms:/data/user/fwyzard/pixeltrack-standalone$ CUDA_VISIBLE_DEVICES=0 numactl -N 0 ./alpaka --cuda --numberOfThreads 16 --numberOfStreams 16 --maxEvents 10000 &> /dev/null; for N in 1 2 3 4; do CUDA_VISIBLE_DEVICES=0 numactl -N 0 ./alpaka --cuda --numberOfThreads 16 --numberOfStreams 16 --maxEvents 10000 | grep throughput; done
Processed 10000 events in 9.610436e+00 seconds, throughput 1040.54 events/s, CPU usage per thread: 66.3%
Processed 10000 events in 9.681004e+00 seconds, throughput 1032.95 events/s, CPU usage per thread: 65.9%
Processed 10000 events in 9.711554e+00 seconds, throughput 1029.7 events/s, CPU usage per thread: 66.1%
Processed 10000 events in 9.698732e+00 seconds, throughput 1031.06 events/s, CPU usage per thread: 66.0%