Closed tonydp03 closed 2 years ago
@tonydp03 rather than simplifying the definition of cms::alpakatools::threadfence
, can you replace it everyehwre with the corresponsing call to alpaka::mem_fence
?
Also, can you comment on the choice of alpaka::memory_scope::Device{}
?
Yes, I was not sure if it was better to delete the threadfence
library and just use alpaka::mem_fence
directly where needed.
About the choice, memory fences in Alpaka can be issued on the block or device level through alpaka::memory_scope::Block
and alpaka::memory_scope::Device
. In pixeltrack standalone, memory fences were applied by means of std::atomic_thread_fence
for TBB (alpaka::mem_fence
specialization for CPU doesn't change on the memory_scope
selected) and __threadfence
for CUDA (which is issued on the device
level according to CUDA documentation).
@tonydp03 thanks for the latest update.
I've repeated the measurements on a Tesla T4 with an AMD EPYC 75F3 32-core processor.
For alpaka --cuda
I didn't see any significant difference, the throughput is always within ~0.1% -- which is expected, since both versions end up calling __threadfence
.
For alpaka --serial
I start seeing a measurable difference, of the order of 2% with 12 threads and 5% with 20 threads:
threads | events | alpaka --serial (master) | alpaka --serial (#320) | relative throughput |
---|---|---|---|---|
2 | 20000 | 80.5 ± 0.5 ev/s | 80.1 ± 1.2 ev/s | 99.5% ± 2.0% |
4 | 20000 | 149.3 ± 0.9 ev/s | 151.3 ± 0.3 ev/s | 101.3% ± 0.8% |
6 | 20000 | 225.2 ± 5.4 ev/s | 223.4 ± 12.1 ev/s | 99.2% ± 7.8% |
8 | 20000 | 299.6 ± 1.1 ev/s | 297.8 ± 1.0 ev/s | 99.4% ± 0.7% |
10 | 20000 | 371.0 ± 6.4 ev/s | 368.1 ± 2.0 ev/s | 99.2% ± 2.3% |
12 | 20000 | 444.0 ± 0.5 ev/s | 436.1 ± 2.2 ev/s | 98.2% ± 0.6% |
14 | 20000 | 522.1 ± 0.9 ev/s | 509.6 ± 1.7 ev/s | 97.6% ± 0.5% |
16 | 20000 | 590.2 ± 2.5 ev/s | 572.5 ± 3.0 ev/s | 97.0% ± 0.9% |
18 | 20000 | 667.8 ± 2.9 ev/s | 638.8 ± 1.1 ev/s | 95.7% ± 0.6% |
20 | 20000 | 734.5 ± 1.0 ev/s | 695.3 ± 3.0 ev/s | 94.7% ± 0.5% |
Comparing the code, the current implementation of cms::alpakatools::threadfence()
evaluates to nothing in the cpu serial case, while alpaka::mem_fence(acc, alpaka::memory_scope::Device{})
does call std::atomic_thread_fence(std::memory_order_acq_rel)
.
I suspect that we may need an additional case to memory_scope::Block
and memory_scope::Device
, something like memory_scope::Grid
:
__threadfence
for the CUDA/HIP backend, because they don't provide a grid-wise atomicstd::atomic_thread_fence(...)
for the TBB and other parallel CPU backendsCould you discuss the idea with the alpaka developers ?
Also, it might be a good idea to use the same naming scheme for the atomic operations and memory fences.
Between the two, I prefer the one used by the memory fences, I find the one used by the atomic operations quite confusing :-/
I see, on the block level alpaka::mem_fence
does nothing for Cpu serial, but on device level it was conceived to call std::atomic_thread_fence
in case of synchronization with other serial kernels. I think that the addition of memory_scope::Grid
makes sense. I'll discuss with the other developers along with the naming scheme.
@tonydp03 I've updated your branch to use the grid-wise memory fences introduced in alpaka-group/alpaka#1641 .
And here is the new comparison:
threads | events | alpaka --serial (master) | alpaka --serial (grid-wise) | relative throughput |
---|---|---|---|---|
2 | 20000 | 80.9 ev/s ± 0.5 ev/s | 80.6 ev/s ± 0.1 ev/s | 99.5% ± 0.6% |
4 | 20000 | 150.5 ev/s ± 0.6 ev/s | 149.5 ev/s ± 0.4 ev/s | 99.3% ± 0.5% |
6 | 20000 | 228.5 ev/s ± 4.5 ev/s | 230.7 ev/s ± 0.8 ev/s | 100.9% ± 2.0% |
8 | 20000 | 301.0 ev/s ± 1.1 ev/s | 301.5 ev/s ± 0.8 ev/s | 100.1% ± 0.5% |
10 | 20000 | 372.9 ev/s ± 10.0 ev/s | 373.1 ev/s ± 6.5 ev/s | 100.1% ± 3.2% |
12 | 20000 | 446.5 ev/s ± 1.1 ev/s | 446.1 ev/s ± 1.1 ev/s | 99.9% ± 0.4% |
14 | 20000 | 525.2 ev/s ± 2.5 ev/s | 520.0 ev/s ± 7.8 ev/s | 99.0% ± 1.6% |
16 | 20000 | 594.1 ev/s ± 2.1 ev/s | 594.7 ev/s ± 0.5 ev/s | 100.1% ± 0.4% |
18 | 20000 | 671.6 ev/s ± 1.6 ev/s | 659.3 ev/s ± 6.1 ev/s | 98.2% ± 0.9% |
20 | 20000 | 740.8 ev/s ± 1.7 ev/s | 739.4 ev/s ± 2.2 ev/s | 99.8% ± 0.4% |
Perfect. Now the performance of alpaka --serial
seems the same as for serial
within acceptable %.
This small PR adds the new alpaka memory fence functionality (introduced in alpaka v0.8.0) to AlpakaCore. Validated with serial and cuda backends, the throughput remains unchanged.