Open makortel opened 2 years ago
Mhm, this doesn't seem to be working as intended, at least in my test on a GTX 1080 Ti:
taskset -c 0-15 ./kokkos --cuda --numberOfThreads 16 --numberOfStreams 16 --maxEvents 20000
Processing 20000 events, of which 16 concurrently, with 16 threads.
Processed 20000 events in 2.791611e+01 seconds, throughput 716.432 events/s, CPU usage per thread: 62.2%
taskset -c 0-15 ./kokkos --cuda --numberOfThreads 16 --numberOfStreams 16 --maxEvents 20000
Processing 20000 events, of which 16 concurrently, with 16 threads.
Processed 20000 events in 2.793876e+01 seconds, throughput 715.851 events/s, CPU usage per thread: 62.4%
taskset -c 0-15 ./kokkos --cuda --numberOfThreads 16 --numberOfStreams 16 --maxEvents 20000
Processing 20000 events, of which 16 concurrently, with 16 threads.
Processed 20000 events in 4.445962e+01 seconds, throughput 449.846 events/s, CPU usage per thread: 61.1%
Though I'm still at my first coffee, so I cannot guarantee there weren't any mistakes...
I ran similar tests (1 thread with 1k events, and 16 threads with 20k events) on
Processing 1000 events, of which 1 concurrently, with 1 threads.
Processed 1000 events in 3.812683e+00 seconds, throughput 262.282 events/s, CPU usage per thread: 113.2%
Processing 20000 events, of which 16 concurrently, with 16 threads.
Processed 20000 events in 4.519340e+01 seconds, throughput 442.543 events/s, CPU usage per thread: 51.1%
Processing 1000 events, of which 1 concurrently, with 1 threads.
Processed 1000 events in 1.846951e+00 seconds, throughput 541.433 events/s, CPU usage per thread: 122.8%
Processing 20000 events, of which 16 concurrently, with 16 threads.
Processed 20000 events in 1.564639e+01 seconds, throughput 1278.25 events/s, CPU usage per thread: 72.0%
Processing 1000 events, of which 1 concurrently, with 1 threads.
Processed 1000 events in 1.848977e+00 seconds, throughput 540.84 events/s, CPU usage per thread: 117.9%
Processing 20000 events, of which 16 concurrently, with 16 threads.
Processed 20000 events in 1.646719e+01 seconds, throughput 1214.54 events/s, CPU usage per thread: 72.9%
Processing 1000 events, of which 1 concurrently, with 1 threads.
Processed 1000 events in 7.080183e+00 seconds, throughput 141.239 events/s, CPU usage per thread: 96.4%
Processing 20000 events, of which 16 concurrently, with 16 threads.
Processed 20000 events in 1.020225e+02 seconds, throughput 196.035 events/s, CPU usage per thread: 54.9%
Processing 1000 events, of which 1 concurrently, with 1 threads.
Processed 1000 events in 6.535374e+00 seconds, throughput 153.013 events/s, CPU usage per thread: 89.3%
Processing 20000 events, of which 16 concurrently, with 16 threads.
Processed 20000 events in 1.004504e+02 seconds, throughput 199.103 events/s, CPU usage per thread: 59.1%
Processing 1000 events, of which 1 concurrently, with 1 threads.
Processed 1000 events in 6.933106e+00 seconds, throughput 144.235 events/s, CPU usage per thread: 87.1%
Processing 20000 events, of which 16 concurrently, with 16 threads.
Processed 20000 events in 1.102847e+02 seconds, throughput 181.349 events/s, CPU usage per thread: 58.9%
Earlier I had tested only on RTX 2080 and was happy with the #333 giving similar performance as #309. But my GTX 1050 Ti test reproduces the GTX 1080 result in https://github.com/cms-patatrack/pixeltrack-standalone/pull/333#issuecomment-1062629038, so this appears to be a real effect.
Here is a plot on V100 (~2 minutes running for each point, on the same CoriGPU node)
So on Volta both fixes work, but disabling the "new atomics" yields a bit higher throughput for >= 3 concurrent events. Perhaps it would be best to go with #309 for now, and rebase this PR on top of that and leave it open for time being.
Rebased following the merge of #309.
Better workaround than #309, see https://github.com/kokkos/kokkos/issues/4780 for more details.
In addition, this PR adds support for using Kokkos' profiling tools via the
KOKKOS_PROFILE_LIBRARY
environment variable. (functionality that we were missing because of heavily customized initialization of Kokkos).