Remote by-value copies of arrays use remote domain causing communication bottleneck

LouisJenkinsCS commented 6 years ago

One idiom that I believe should be considered standard is the localizing of small arrays allocated on a single locale. Currently, as domains are shared whenever you make a copy of an array, it results in all accesses of that array being remote which kind of destroys the purpose of localizing arrays. For example, the below example fails as _arr's domain is allocated on locale 1. As the domain is used for more complex things than querying the size, having all accesses translate into remote PUT/GET operations can be a severe bottleneck and even eliminate any perceived performance advantage you would obtain from localizing it to begin with.

 var arr : [1..10] int;

 on Locales[1] {
   var _arr = arr;
   var sz : int;
   local do sz = _arr.domain.size;
 }

One way around this is to manually make a copy of the domain and then perform an array assignment.

 var arr : [1..10] int;

 on Locales[1] {
   const _dom = arr.domain;
   var _arr : [_dom] int = arr;
   var sz : int;
   local do sz = _arr.domain.size;
 }

I believe that if an array copy is created on another locale it should make a copy of the original domain. In real code, you could easily be performing a dual coforall and end up with multiple round-trips which sink performance without even knowing the cause.

bradcray commented 6 years ago

I believe that the original code should be optimized by having the implementation recognize that the domain of arr is constant and unchanging, permitting the compiler / modules to create a local copy of the domain on locale 1. If arr's domain is not constant, I don't believe this implicit localization of the domain should occur. This is one of several optimizations we've discussed implementing for constant domains (tagging @ronawho and @benharsh, as they've run into similar bottlenecks in various codes over time).

As the domain is used for more complex things than querying the size, having all accesses translate into remote PUT/GET operations can be a severe bottleneck and even eliminate any perceived performance advantage you would obtain from localizing it to begin with.

My impression is that most array operations don't access the domain (at least, with --no-checks / --fast enabled). Are you finding this not to be the case?

LouisJenkinsCS commented 6 years ago

For certain things that query the domain, such as for a binary search or sort, it will access the domain many many times.

bradcray commented 6 years ago

OK, thanks. But no core (non-library) array features that you've encountered?

LouisJenkinsCS commented 6 years ago

I'm not sure, the only way to test this kind of stuff would be to run verbose CommDiagnostics and I only do so for performance-critical code. However, as a potential guideline, I'd say any array features that make heavy use of the domain would end up, with an exception for special vector-operations like push_back, would suffer from this issue.

Now that I think about it there was an issue where the field ranges in DefaultRectangularArray.chpl became entirely remote, meaning anytime it is used caused a large number of round-trips, especially if the range itself had some internal logic. Doing a 'ctrl + f' for ranges shows just how many round-trips it causes. Can't show you an MWE yet though, but I can put it on my TODO list if you're too busy to get around to it/request it.

bradcray commented 6 years ago

I'm not sure...

OK, that's a good sign if you're not aware of problem cases already.

I only do so for performance-critical code.

In a similar vein, I'm more motivated to spend effort fixing things that are actually causing users problems than those that might potentially cause somebody a problem someday (since there are an arbitrary large number of issues in the latter category).

w.r.t. the cases we've discussed:

I'm not seeing many references to ranges in DefaultRectangularArray.chpl within the array code itself—most of them appear in the domain code. For those cases, as noted above, I think that it'd be worthwhile to have the compiler remote-value forward constant domains since this would eliminate communication for queries on the domain itself (like your size query, and likely most of the other routines that access ranges in DefaultRectangularArray.chpl).
push_back and friends can only be applied when there's a 1:1 relationship between an array and its domain, so wouldn't apply in a case like this since there is a second array sharing the same domain.
In the case of search and sort, I'd argue that the simplest fix would be to improve the library routines themselves to avoid so many queries to the domain (but if I'm understanding you correctly, you're not actually getting bitten by this, but are anticipating that others might be someday?)

Can't show you an MWE yet though, but I can put it on my TODO list if you're too busy to get around to it/request it.

I'm not requesting that you create anything, I'm just trying to understand in what cases this is actually turning into a communication problem for you to determine what the most expedient fix(es) would be and hopefully close this issue (while also checking that my assumptions are correct). It sounds like you're not aware of any problematic core operations on the array, and that most of the issues you're aware of, or are anticipating, relate to queries on the domain (which the example code didn't localize, so isn't surprising; but which the compiler arguably could/should optimize in cases like this).

I believe that if an array copy is created on another locale it should make a copy of the original domain.

To expand on a comment I made above: We try hard to define Chapel's semantics so that a computation will behave similarly whether it spans multiple locales or not, so I'm mentally translating this request into "if any array copy is created, it should make a copy of the original domain." But I don't think that's a direction we'd want to pursue due to the multiple benefits that [can / do] result from sharing domains between arrays. This is what causes me to think that the right way to generally reduce communication in cases like this is to remote value forward the domain when it's constant (or sufficiently constant w.r.t. the on-clause for r.v.f. to fire), and to not do so when it's not (i.e., when the domain may actually be changing, so needs to be referred to remotely).

In the meantime, a way to make your workaround a little more attractive would be to add a helper routine that localized both the array and the domain:

proc localizeArray(X: [?D] ?t) {
  const locD = D;
  var locX: [locD] t = X;
  return locX;
}

where your original code could then be re-written as:

on Locales[1] {
  var _arr = localizeArray(arr);
  var sz : int;
  local do sz = _arr.domain.size;
}

bradcray commented 6 years ago

[I originally posted this out of the intended order due to losing track of all the tabs I had open. Putting it in its proper order here].

Shoot, looking around at related issues, I think that the solution I proposed is a bit more challenging than I had hoped. Specifically, I'm remembering that issue #6995 / PR #7050 already added support for remote-value forwarding of domains and that the reason that this case isn't handled is because the domain is referred to implicitly through the array's .domain method rather than explicitly by name. Specifically, note that while the following "works":

const D = {1..10};
var arr : [D] int;

on Locales[1] {
  var _arr : [D] int = arr;
  var sz : int;
  local do sz = _arr.domain.size;
}

if the D in _arr's declaration is changed to arr.domain then it no longer does. This is because remote value forwarding is based on the symbols explicitly named within an on-clause, not those which might happen to be referred to indirectly through a method call. And I think, in general, that determining what symbols are referred to indirectly in this way is challenging. So then, if I'm not missing anything, the question might become "how much effort should we put into specialized optimization of this array/domain case" given that they're a bit more "built into" the language?

LouisJenkinsCS commented 6 years ago

I didn't realize the problem would be non-trivial. What would you suggest for a workaround where we only have access to the array and not the domain? I,E if it was passed to a function by reference. I'm assuming this case would prevent remote-value forwarding.

bradcray commented 6 years ago

What would you suggest for a workaround where we only have access to the array and not the domain?

I think the proposed localizeArray() routine above should still work in this case.

bradcray commented 6 years ago

Interestingly, I just ran smack into this same issue. Here's what happened essentially:

var vec: [1..n] real;
forall elem in DistArray with (in vec) {
  for v in vec do ...
}

My intuition was "the copy-in of vec will result in it being completely local to my current domain, so there shouldn't be any communication within the body of this loop", yet I was wrong because the for v in vec... loop accesses the low bound of its domain and its domain was back on locale 0. Whoops.

While this still doesn't make me think that we should copy domains in cases like this, it does make me wonder what it would take to rewrite the default iterator over vec to not require access to the domain...

bradcray commented 6 years ago

Tagging @benharsh on this just for his general interest because he touched the default rectangular array iterator most recently, and because it's another example of a case where using a vector type that carried its domain and array around with it would've helped (if I'd used it instead of an array). Tagging @ronawho because of his general interest in performance-related issues.

bradcray commented 4 weeks ago

For arrays whose domains are sufficiently const, as in the OP here, we now optimize the localization of the domain by default, resulting in an implementation similar to that of the workaround in the OP. This was implemented in #24391 and enabled by default in #25517. Closing this as a result—thanks to @LouisJenkinsCS for pointing it out (so long ago now)!

chapel-lang / chapel

Remote by-value copies of arrays use remote domain causing communication bottleneck #10888