Open ronawho opened 6 years ago
@ronawho: Checking to see whether this epic can be closed or updated w.r.t. recent advances.
Performance of RA-atomics was significantly improved in 1.18. Up to 45% with no code changes, and up to 6x when switching to buffered atomics:
Improvements to RA-atomics from #9876:
Ability to use buffered atomics for RA-atomics from #10702:
Given that bale-histo (which is a very similar benchmark to RA) performance is on par with UPC/SHMEM, I think we've done all we can to optimize the atomic version of RA.
@ronawho: Checking to see whether this epic can be closed or updated w.r.t. recent advances.
Updated with recent perf results. I think the last thing I'd like to do here before closing is to add nightly oversubscribed performance testing (i.e. run with -sdataParTasksPerLocale=here.maxTaskPar*2
)
RA performance performance tracks the non-bucketed reference MPI version and actually starts to outscale at 256 locales:
Given that we seem to be out-scaling the reference I'm not sure if there's any immediate next steps here at modest locale counts.
I'll also note that we perform better if we oversubscribe RA (can overlap comm/compute):
I think we should update the benchmarks to run with and without oversubscription.
We did see some possible regressions at higher locale counts (512/1024) that we should look into: https://github.com/chapel-lang/chapel/issues/9435
TODOs: