chapel-lang / chapel

a Productive Parallel Programming Language
https://chapel-lang.org
Other
1.77k stars 416 forks source link

Improve performance of RA #9625

Open ronawho opened 6 years ago

ronawho commented 6 years ago

RA performance performance tracks the non-bucketed reference MPI version and actually starts to outscale at 256 locales:

ra-perf

Given that we seem to be out-scaling the reference I'm not sure if there's any immediate next steps here at modest locale counts.

I'll also note that we perform better if we oversubscribe RA (can overlap comm/compute):

ra-oversub

I think we should update the benchmarks to run with and without oversubscription.

We did see some possible regressions at higher locale counts (512/1024) that we should look into: https://github.com/chapel-lang/chapel/issues/9435

TODOs:

bradcray commented 5 years ago

@ronawho: Checking to see whether this epic can be closed or updated w.r.t. recent advances.

ronawho commented 5 years ago

Performance of RA-atomics was significantly improved in 1.18. Up to 45% with no code changes, and up to 6x when switching to buffered atomics:

Given that bale-histo (which is a very similar benchmark to RA) performance is on par with UPC/SHMEM, I think we've done all we can to optimize the atomic version of RA.

ronawho commented 5 years ago

@ronawho: Checking to see whether this epic can be closed or updated w.r.t. recent advances.

Updated with recent perf results. I think the last thing I'd like to do here before closing is to add nightly oversubscribed performance testing (i.e. run with -sdataParTasksPerLocale=here.maxTaskPar*2)