[kokkos] Work around performance issue by using only 'unsigned long' in AtomicPairCounter

makortel commented 2 years ago

Better workaround than #309, see https://github.com/kokkos/kokkos/issues/4780 for more details.

In addition, this PR adds support for using Kokkos' profiling tools via the KOKKOS_PROFILE_LIBRARY environment variable. (functionality that we were missing because of heavily customized initialization of Kokkos).

fwyzard commented 2 years ago

Mhm, this doesn't seem to be working as intended, at least in my test on a GTX 1080 Ti:

master

taskset -c 0-15 ./kokkos --cuda --numberOfThreads 16 --numberOfStreams 16 --maxEvents 20000
Processing 20000 events, of which 16 concurrently, with 16 threads.
Processed 20000 events in 2.791611e+01 seconds, throughput 716.432 events/s, CPU usage per thread: 62.2%

master + #309

taskset -c 0-15 ./kokkos --cuda --numberOfThreads 16 --numberOfStreams 16 --maxEvents 20000
Processing 20000 events, of which 16 concurrently, with 16 threads.
Processed 20000 events in 2.793876e+01 seconds, throughput 715.851 events/s, CPU usage per thread: 62.4%

master + #333

taskset -c 0-15 ./kokkos --cuda --numberOfThreads 16 --numberOfStreams 16 --maxEvents 20000
Processing 20000 events, of which 16 concurrently, with 16 threads.
Processed 20000 events in 4.445962e+01 seconds, throughput 449.846 events/s, CPU usage per thread: 61.1%

Though I'm still at my first coffee, so I cannot guarantee there weren't any mistakes...

makortel commented 2 years ago

I ran similar tests (1 thread with 1k events, and 16 threads with 20k events) on

RTX 2080 SUPER

master

Processing 1000 events, of which 1 concurrently, with 1 threads.
Processed 1000 events in 3.812683e+00 seconds, throughput 262.282 events/s, CPU usage per thread: 113.2%
Processing 20000 events, of which 16 concurrently, with 16 threads.
Processed 20000 events in 4.519340e+01 seconds, throughput 442.543 events/s, CPU usage per thread: 51.1%

master + #309

Processing 1000 events, of which 1 concurrently, with 1 threads.
Processed 1000 events in 1.846951e+00 seconds, throughput 541.433 events/s, CPU usage per thread: 122.8%
Processing 20000 events, of which 16 concurrently, with 16 threads.
Processed 20000 events in 1.564639e+01 seconds, throughput 1278.25 events/s, CPU usage per thread: 72.0%

master + #333

Processing 1000 events, of which 1 concurrently, with 1 threads.
Processed 1000 events in 1.848977e+00 seconds, throughput 540.84 events/s, CPU usage per thread: 117.9%
Processing 20000 events, of which 16 concurrently, with 16 threads.
Processed 20000 events in 1.646719e+01 seconds, throughput 1214.54 events/s, CPU usage per thread: 72.9%

GTX 1050 Ti

master

Processing 1000 events, of which 1 concurrently, with 1 threads.
Processed 1000 events in 7.080183e+00 seconds, throughput 141.239 events/s, CPU usage per thread: 96.4%
Processing 20000 events, of which 16 concurrently, with 16 threads.
Processed 20000 events in 1.020225e+02 seconds, throughput 196.035 events/s, CPU usage per thread: 54.9%

master + #309

Processing 1000 events, of which 1 concurrently, with 1 threads.
Processed 1000 events in 6.535374e+00 seconds, throughput 153.013 events/s, CPU usage per thread: 89.3%
Processing 20000 events, of which 16 concurrently, with 16 threads.
Processed 20000 events in 1.004504e+02 seconds, throughput 199.103 events/s, CPU usage per thread: 59.1%

master + #333

Processing 1000 events, of which 1 concurrently, with 1 threads.
Processed 1000 events in 6.933106e+00 seconds, throughput 144.235 events/s, CPU usage per thread: 87.1%
Processing 20000 events, of which 16 concurrently, with 16 threads.
Processed 20000 events in 1.102847e+02 seconds, throughput 181.349 events/s, CPU usage per thread: 58.9%

Earlier I had tested only on RTX 2080 and was happy with the #333 giving similar performance as #309. But my GTX 1050 Ti test reproduces the GTX 1080 result in https://github.com/cms-patatrack/pixeltrack-standalone/pull/333#issuecomment-1062629038, so this appears to be a real effect.

makortel commented 2 years ago

Here is a plot on V100 (~2 minutes running for each point, on the same CoriGPU node) kokkos_cuda_throughput

So on Volta both fixes work, but disabling the "new atomics" yields a bit higher throughput for >= 3 concurrent events. Perhaps it would be best to go with #309 for now, and rebase this PR on top of that and leave it open for time being.

makortel commented 2 years ago

Rebased following the merge of #309.

cms-patatrack / pixeltrack-standalone