cms-patatrack / pixeltrack-standalone

Standalone Patatrack pixel tracking
Apache License 2.0
17 stars 35 forks source link

[alpaka] Use alpaka memory fences #320

Closed tonydp03 closed 2 years ago

tonydp03 commented 2 years ago

This small PR adds the new alpaka memory fence functionality (introduced in alpaka v0.8.0) to AlpakaCore. Validated with serial and cuda backends, the throughput remains unchanged.

alpaka_mem_fence
fwyzard commented 2 years ago

@tonydp03 rather than simplifying the definition of cms::alpakatools::threadfence, can you replace it everyehwre with the corresponsing call to alpaka::mem_fence ?

Also, can you comment on the choice of alpaka::memory_scope::Device{} ?

tonydp03 commented 2 years ago

Yes, I was not sure if it was better to delete the threadfence library and just use alpaka::mem_fence directly where needed. About the choice, memory fences in Alpaka can be issued on the block or device level through alpaka::memory_scope::Block and alpaka::memory_scope::Device. In pixeltrack standalone, memory fences were applied by means of std::atomic_thread_fence for TBB (alpaka::mem_fence specialization for CPU doesn't change on the memory_scope selected) and __threadfence for CUDA (which is issued on the device level according to CUDA documentation).

fwyzard commented 2 years ago

@tonydp03 thanks for the latest update.

I've repeated the measurements on a Tesla T4 with an AMD EPYC 75F3 32-core processor.

For alpaka --cuda I didn't see any significant difference, the throughput is always within ~0.1% -- which is expected, since both versions end up calling __threadfence.

For alpaka --serial I start seeing a measurable difference, of the order of 2% with 12 threads and 5% with 20 threads:

threads events alpaka --serial (master) alpaka --serial (#320) relative throughput
2 20000 80.5 ± 0.5 ev/s 80.1 ± 1.2 ev/s 99.5% ± 2.0%
4 20000 149.3 ± 0.9 ev/s 151.3 ± 0.3 ev/s 101.3% ± 0.8%
6 20000 225.2 ± 5.4 ev/s 223.4 ± 12.1 ev/s 99.2% ± 7.8%
8 20000 299.6 ± 1.1 ev/s 297.8 ± 1.0 ev/s 99.4% ± 0.7%
10 20000 371.0 ± 6.4 ev/s 368.1 ± 2.0 ev/s 99.2% ± 2.3%
12 20000 444.0 ± 0.5 ev/s 436.1 ± 2.2 ev/s 98.2% ± 0.6%
14 20000 522.1 ± 0.9 ev/s 509.6 ± 1.7 ev/s 97.6% ± 0.5%
16 20000 590.2 ± 2.5 ev/s 572.5 ± 3.0 ev/s 97.0% ± 0.9%
18 20000 667.8 ± 2.9 ev/s 638.8 ± 1.1 ev/s 95.7% ± 0.6%
20 20000 734.5 ± 1.0 ev/s 695.3 ± 3.0 ev/s 94.7% ± 0.5%

Comparing the code, the current implementation of cms::alpakatools::threadfence() evaluates to nothing in the cpu serial case, while alpaka::mem_fence(acc, alpaka::memory_scope::Device{}) does call std::atomic_thread_fence(std::memory_order_acq_rel).

I suspect that we may need an additional case to memory_scope::Block and memory_scope::Device, something like memory_scope::Grid:

Could you discuss the idea with the alpaka developers ?

fwyzard commented 2 years ago

Also, it might be a good idea to use the same naming scheme for the atomic operations and memory fences.

Between the two, I prefer the one used by the memory fences, I find the one used by the atomic operations quite confusing :-/

tonydp03 commented 2 years ago

I see, on the block level alpaka::mem_fence does nothing for Cpu serial, but on device level it was conceived to call std::atomic_thread_fence in case of synchronization with other serial kernels. I think that the addition of memory_scope::Grid makes sense. I'll discuss with the other developers along with the naming scheme.

fwyzard commented 2 years ago

@tonydp03 I've updated your branch to use the grid-wise memory fences introduced in alpaka-group/alpaka#1641 .

fwyzard commented 2 years ago

And here is the new comparison:

threads events alpaka --serial (master) alpaka --serial (grid-wise) relative throughput
2 20000 80.9 ev/s ± 0.5 ev/s 80.6 ev/s ± 0.1 ev/s 99.5% ± 0.6%
4 20000 150.5 ev/s ± 0.6 ev/s 149.5 ev/s ± 0.4 ev/s 99.3% ± 0.5%
6 20000 228.5 ev/s ± 4.5 ev/s 230.7 ev/s ± 0.8 ev/s 100.9% ± 2.0%
8 20000 301.0 ev/s ± 1.1 ev/s 301.5 ev/s ± 0.8 ev/s 100.1% ± 0.5%
10 20000 372.9 ev/s ± 10.0 ev/s 373.1 ev/s ± 6.5 ev/s 100.1% ± 3.2%
12 20000 446.5 ev/s ± 1.1 ev/s 446.1 ev/s ± 1.1 ev/s 99.9% ± 0.4%
14 20000 525.2 ev/s ± 2.5 ev/s 520.0 ev/s ± 7.8 ev/s 99.0% ± 1.6%
16 20000 594.1 ev/s ± 2.1 ev/s 594.7 ev/s ± 0.5 ev/s 100.1% ± 0.4%
18 20000 671.6 ev/s ± 1.6 ev/s 659.3 ev/s ± 6.1 ev/s 98.2% ± 0.9%
20 20000 740.8 ev/s ± 1.7 ev/s 739.4 ev/s ± 2.2 ev/s 99.8% ± 0.4%
tonydp03 commented 2 years ago

Perfect. Now the performance of alpaka --serial seems the same as for serial within acceptable %.