Reduce privatization overheads

ronawho commented 2 years ago

https://github.com/chapel-lang/chapel/pull/15049 significantly improved the speed of creating distributed domains/arrays by allowing bulk transfer for types used as part of privatization. However in https://github.com/chapel-lang/chapel/issues/20164 we've been seeing that there are still some non-trivial overheads for privatization that we'd like to reduce to speed up the creation of arrays (and in particular small arrays) in Arkouda.

https://github.com/chapel-lang/chapel/issues/20167 is motivated by wanting to cache distributed domains and https://github.com/chapel-lang/chapel/issues/20164 touches on how to efficiently store distributed arrays in classes, but even if both of those were addressed we'll still at some point have to create the initial distributed domain/distribution to cache and create distributed arrays so I think speeding them up is worthwhile.

Note that we are considering getting rid of privatization in favor of remote-value-forwarding in https://github.com/Cray/chapel-private/issues/2805, but that seems far-ish off and I think spending a little time optimizing privatization will give us a good baseline to evaluate the RVF approach.

See https://github.com/Cray/chapel-private/issues/521 and https://github.com/chapel-lang/chapel/issues/14132, which have some old numbers and experiments for privatization. See also https://github.com/Cray/chapel-private/issues/503 about abandoning the binary tree used in privatizing domains/array.

I'd probably approach this by creating a simple comm/perf test to explore what's involved in creating a block distribution, block domain, and block array. Then see if we can replace the binary-tree used for privatization with a standard coforall loc in Locales and arrange for the privatization payload to just get send along with the on-stmt bundle.

The other related thing here is that we're seeing reprivatization occur more often than I'd expect, so beyond just making privatization faster it may be worth exploring if there's code paths we can optimize to avoid reprivatizing.

ronawho commented 2 years ago

After chatting with Ben/Engin I think the right approach is still to switch the binary tree to coforall loc in Locales do on loc and see if we can RVF any more data. There are some things we can't RVF because it's an array of wide class pointers (the thing optimized in https://github.com/chapel-lang/chapel/pull/15049) but we have comm for some other things.

https://github.com/chapel-lang/chapel/pull/20241 switches from the tree-based privatization to the coforall loc in Locales do on loc. I believe this is correct, and is passing testing with the exception of a lot of comm count tests that need updating.

https://github.com/ronawho/chapel/compare/abandon-privatization-tree...ronawho:chapel:opt-block-creation-comms is my attempt to reduce extra comm in block dist/dom/arr creation and privatization. I think it's roughly right, but I have not tested (just wanted to capture the current status before going on vacation.) I also haven't done any extensive perf testing on this. It helps a litlte, but most of our time is just spent doing the remote task creation now, so eliminating a few GETs doesn't have a major impact.

I think the main thing remaining is to look at avoiding reprivatization for domains. Currently we create an empty domain (and privatize it) then assign the local domain into it, which ends up resulting in 2 reprivatization calls (and a call to setup(). If we could arrange to just set up the distributed domain to the final state when creating it, we could eliminate 3 coforall+ons, which would have a non-trivial impact.

If we do that we'll effectively just have:

6 coforall+ons with a few GETs to create and privatize the dist, domain, and array
6 coforalls+ons to delete privatization data and free the dist, domain, and array

And I think this is about as optimal as we could get the current strategy. Maybe with a heroic effort we could fuse loops or something, but I don't think that's worthwhile to pursue currently. Longer term maybe lazy privatization or RVF would also help, but those are also larger efforts and this issue is largely focused on optimizing our current privatization strategy.

e-kayrakli commented 2 years ago

I think the main thing remaining is to look at avoiding reprivatization for domains. Currently we create an empty domain (and privatize it) then assign the local domain into it, which ends up resulting in 2 reprivatization calls (and a call to setup(). If we could arrange to just set up the distributed domain to the final state when creating it, we could eliminate 3 coforall+ons, which would have a non-trivial impact.

Trying to remember the core reason why we have double reprivatization, I can't come up with anything. I'd try with commenting this one out and seeing if it causes any trouble: https://github.com/chapel-lang/chapel/blob/main/modules/internal/ChapelDistribution.chpl#L1202-L1204

It looks obviously redundant to me, but I don't think that's the case. I am just not coming up with a good idea as to why we need that one. It also makes sense philosophically that any class/module that implements these types should not call any privatization routines themselves. They should always be called from the wrapper records' methods.

In any case, here's what happens today:

This function is called when you have a {1..n} dmapped Block(..)
Because we don't have proper copy initializers, something like this will
- create distDom
- privatize it
- then assign dom to it
Following the domain assignment path, operator = will call dsiAssignDomain
BlockDom.dsiAssignDomain will use the common helper
Which will do the 1st reprivatization, that I think we should remove
But popping from the call stack, BlockDom.dsiAssignDomain, will do the 2nd privatization

Going back to my branch, what I attempted was to remove the 2nd privatization instead. I really don't recall why I didn't remove the 1st one. It either didn't occur to me, or I bumped into a challenge. The way it looks I tried to do that was by adding an optional argument to dsiAssignDomain interface to avoid the privatization there. Note that that branch has some other stuff in it.

ronawho commented 2 years ago

During the last perf meeting, we discussed just directly creating the domain instead of creating an empty one and then assigning. At least for rectangular domains. This is prototyped in https://github.com/chapel-lang/chapel/commit/c0b1107d2cf1994199464639ffd6727103739b95

We were wondering why this wasn't done originally, but the code is from 2010 when a lot of distribution work was either being added or redone, so it's hard tell what things were done intentionally vs were just good enough for the time. Digging a little, I believe that before https://github.com/chapel-lang/chapel/commit/397be4144ed687eccd982f4d51304cf182db4898 this change wouldn't actually have reduced the amount of communication and the dsi API change made in 397be41 is required to get the comm reduction.

ronawho commented 2 years ago

We switched away from tree-based privatization in https://github.com/chapel-lang/chapel/pull/20241 and started directly creating distributed domains in https://github.com/chapel-lang/chapel/pull/21038. I believe privatization is about as fast as we can get it in the current scheme and believe future work is captured in issues like https://github.com/Cray/chapel-private/issues/2805

chapel-lang / chapel

Reduce privatization overheads #20197