Improve performance of RA

chapel-lang / chapel

a Productive Parallel Programming Language

https://chapel-lang.org

Other

1.77k stars 416 forks source link

Improve performance of RA #9625

Open ronawho opened 6 years ago

ronawho commented 6 years ago

RA performance performance tracks the non-bucketed reference MPI version and actually starts to outscale at 256 locales:

ra-perf

Given that we seem to be out-scaling the reference I'm not sure if there's any immediate next steps here at modest locale counts.

I'll also note that we perform better if we oversubscribe RA (can overlap comm/compute):

ra-oversub

I think we should update the benchmarks to run with and without oversubscription.

We did see some possible regressions at higher locale counts (512/1024) that we should look into: https://github.com/chapel-lang/chapel/issues/9435

TODOs:

[ ] add oversubscribed testing
[x] investigate possible regression: #9435

bradcray commented 5 years ago

@ronawho: Checking to see whether this epic can be closed or updated w.r.t. recent advances.

ronawho commented 5 years ago

Performance of RA-atomics was significantly improved in 1.18. Up to 45% with no code changes, and up to 6x when switching to buffered atomics:

Improvements to RA-atomics from #9876:
Ability to use buffered atomics for RA-atomics from #10702:

Given that bale-histo (which is a very similar benchmark to RA) performance is on par with UPC/SHMEM, I think we've done all we can to optimize the atomic version of RA.

We also improved ra-rmo performance in https://github.com/chapel-lang/chapel/pull/11176:

ronawho commented 5 years ago

@ronawho: Checking to see whether this epic can be closed or updated w.r.t. recent advances.

Updated with recent perf results. I think the last thing I'd like to do here before closing is to add nightly oversubscribed performance testing (i.e. run with -sdataParTasksPerLocale=here.maxTaskPar*2)