Improve performance of ISx

ronawho commented 6 years ago

ISx scalability pretty closely tracks the reference SHMEM version up to 256 locales, but raw performance is still ~40% behind:

isx-time

I believe this is partially due to overhead of full dynamic array registration and could also be a result of only using FMA under ugni instead of BTE. It's also possible we have extra comm compared to the reference version.

TODOs:

[ ] Use dynamic heap extension for serial arrays: #9616
[x] Use BTE for large gets/puts under ugni: #9615
[ ] Track comm counts: #9621 (and see what we can do to improve them)

ronawho commented 6 years ago

An experimental branch that uses BTE for puts and forces heap-extensions (https://github.com/ronawho/chapel/tree/isx-perf) has promising performance that's on par with the reference at 256 locales:

isx-train

bradcray commented 6 years ago

@ronawho: Can this be closed now, or do you want to keep it open to track other potential improvements to ISx?

ronawho commented 6 years ago

I want to keep it open. I think there's more we can do (noinit on local arrays, minimize comm-counts in exchange, etc.) Given that performance is competitive with SHMEM, I don't think it's important to look into those in the near future, but I don't want to lose track of these ideas.

chapel-lang / chapel

Improve performance of ISx #9622