[alpaka] Use alpaka memory fences

tonydp03 commented 2 years ago

This small PR adds the new alpaka memory fence functionality (introduced in alpaka v0.8.0) to AlpakaCore. Validated with serial and cuda backends, the throughput remains unchanged.

fwyzard commented 2 years ago

@tonydp03 rather than simplifying the definition of cms::alpakatools::threadfence, can you replace it everyehwre with the corresponsing call to alpaka::mem_fence ?

Also, can you comment on the choice of alpaka::memory_scope::Device{} ?

tonydp03 commented 2 years ago

Yes, I was not sure if it was better to delete the threadfence library and just use alpaka::mem_fence directly where needed. About the choice, memory fences in Alpaka can be issued on the block or device level through alpaka::memory_scope::Block and alpaka::memory_scope::Device. In pixeltrack standalone, memory fences were applied by means of std::atomic_thread_fence for TBB (alpaka::mem_fence specialization for CPU doesn't change on the memory_scope selected) and __threadfence for CUDA (which is issued on the device level according to CUDA documentation).

fwyzard commented 2 years ago

@tonydp03 thanks for the latest update.

I've repeated the measurements on a Tesla T4 with an AMD EPYC 75F3 32-core processor.

For alpaka --cuda I didn't see any significant difference, the throughput is always within ~0.1% -- which is expected, since both versions end up calling __threadfence.

For alpaka --serial I start seeing a measurable difference, of the order of 2% with 12 threads and 5% with 20 threads:

threads	events	alpaka --serial (master)	alpaka --serial (#320)	relative throughput
2	20000	80.5 ± 0.5 ev/s	80.1 ± 1.2 ev/s	99.5% ± 2.0%
4	20000	149.3 ± 0.9 ev/s	151.3 ± 0.3 ev/s	101.3% ± 0.8%
6	20000	225.2 ± 5.4 ev/s	223.4 ± 12.1 ev/s	99.2% ± 7.8%
8	20000	299.6 ± 1.1 ev/s	297.8 ± 1.0 ev/s	99.4% ± 0.7%
10	20000	371.0 ± 6.4 ev/s	368.1 ± 2.0 ev/s	99.2% ± 2.3%
12	20000	444.0 ± 0.5 ev/s	436.1 ± 2.2 ev/s	98.2% ± 0.6%
14	20000	522.1 ± 0.9 ev/s	509.6 ± 1.7 ev/s	97.6% ± 0.5%
16	20000	590.2 ± 2.5 ev/s	572.5 ± 3.0 ev/s	97.0% ± 0.9%
18	20000	667.8 ± 2.9 ev/s	638.8 ± 1.1 ev/s	95.7% ± 0.6%
20	20000	734.5 ± 1.0 ev/s	695.3 ± 3.0 ev/s	94.7% ± 0.5%

Comparing the code, the current implementation of cms::alpakatools::threadfence() evaluates to nothing in the cpu serial case, while alpaka::mem_fence(acc, alpaka::memory_scope::Device{}) does call std::atomic_thread_fence(std::memory_order_acq_rel).

I suspect that we may need an additional case to memory_scope::Block and memory_scope::Device, something like memory_scope::Grid:

call __threadfence for the CUDA/HIP backend, because they don't provide a grid-wise atomic
call std::atomic_thread_fence(...) for the TBB and other parallel CPU backends
call nothing for the serial CPU backend, because the execution is fully serial at the block and grid level

Could you discuss the idea with the alpaka developers ?

fwyzard commented 2 years ago

Also, it might be a good idea to use the same naming scheme for the atomic operations and memory fences.

Between the two, I prefer the one used by the memory fences, I find the one used by the atomic operations quite confusing :-/

tonydp03 commented 2 years ago

I see, on the block level alpaka::mem_fence does nothing for Cpu serial, but on device level it was conceived to call std::atomic_thread_fence in case of synchronization with other serial kernels. I think that the addition of memory_scope::Grid makes sense. I'll discuss with the other developers along with the naming scheme.

fwyzard commented 2 years ago

@tonydp03 I've updated your branch to use the grid-wise memory fences introduced in alpaka-group/alpaka#1641 .

fwyzard commented 2 years ago

And here is the new comparison:

threads	events	alpaka --serial (master)	alpaka --serial (grid-wise)	relative throughput
2	20000	80.9 ev/s ± 0.5 ev/s	80.6 ev/s ± 0.1 ev/s	99.5% ± 0.6%
4	20000	150.5 ev/s ± 0.6 ev/s	149.5 ev/s ± 0.4 ev/s	99.3% ± 0.5%
6	20000	228.5 ev/s ± 4.5 ev/s	230.7 ev/s ± 0.8 ev/s	100.9% ± 2.0%
8	20000	301.0 ev/s ± 1.1 ev/s	301.5 ev/s ± 0.8 ev/s	100.1% ± 0.5%
10	20000	372.9 ev/s ± 10.0 ev/s	373.1 ev/s ± 6.5 ev/s	100.1% ± 3.2%
12	20000	446.5 ev/s ± 1.1 ev/s	446.1 ev/s ± 1.1 ev/s	99.9% ± 0.4%
14	20000	525.2 ev/s ± 2.5 ev/s	520.0 ev/s ± 7.8 ev/s	99.0% ± 1.6%
16	20000	594.1 ev/s ± 2.1 ev/s	594.7 ev/s ± 0.5 ev/s	100.1% ± 0.4%
18	20000	671.6 ev/s ± 1.6 ev/s	659.3 ev/s ± 6.1 ev/s	98.2% ± 0.9%
20	20000	740.8 ev/s ± 1.7 ev/s	739.4 ev/s ± 2.2 ev/s	99.8% ± 0.4%

tonydp03 commented 2 years ago

Perfect. Now the performance of alpaka --serial seems the same as for serial within acceptable %.

cms-patatrack / pixeltrack-standalone

[alpaka] Use alpaka memory fences #320