chapel-lang / chapel

a Productive Parallel Programming Language
https://chapel-lang.org
Other
1.78k stars 420 forks source link

Remote-declared arrays and `forall` locality #25351

Open e-kayrakli opened 3 months ago

e-kayrakli commented 3 months ago

2.1 has the initial implementation for remote variable declarations. One of the limitations of this initial implementation comes from not the feature itself but readiness in the module code. Take the following example:

on Locales[1] var Arr: [1..1] int;
forall a in Arr {
  writeln(here);
}

I'd expect LOCALE1 to be output but this gives LOCALE0. This is most likely about introducing an on this or equivalent in default rectangular array implementation. Whether that could impact performance, I don't know, but it'd be something to check for sure.

Note that implementing this should also change the behavior of:

var Arr: [1..1] int;
on Locales[1] {
  forall a in Arr do
    writeln(here);  // says LOCALE1, expect LOCALE0
}
bradcray commented 3 months ago

I'd expect LOCALE1 to be output but this gives LOCALE0.

I'm feeling somewhat conflicted about this suggestion. On one hand, I can imagine an argument like "When we iterate over a block-distributed domain/array, we implement the loop by creating tasks on the places that we think are most natural for that iteration, so shouldn't we do that for DR arrays/domains as well?

But on the other hand, what if the body of the second loop accesses tons of data on locale 1 such that the current behavior is what I wanted; in that case, the only recourse I have is to put an on-clause within the body of the forall loop, but that's going to be very expensive. Or, I suppose I could do something like rewrite it to:

on Locales[1] {
  const ArrDom = Arr.domain;
  forall i in ArrDom do
    writeln(here);  // and substituting `A[i]` for `a` in any loop that referred to the elements from the original
}

but that seems pretty onerous.

It also feels like a big behavioral change for 2.0. Granted, we're discussing an anti-pattern (i.e., maybe Arr should've just been declared within the on-clause?), but…

I imagine the motivation for this is so that if Locales[1] was reallyhere.gpus[0]in the initial example then the user would get an automatic kernel launch from theirforallloop, but should they? On one hand, it'd be convenient if that's what the user wanted; but on the other hand, that feels like whaton here.gpus[0]` is for and it seems like it could increase confusion about which code runs on the CPU or GPU when a loop's header and/or body refers to a mix of CPU and GPU variables.

This feels fairly analogous to when people loop over a range and then use it to access a distributed array and are surprised that the loop completely executes on locale 0. But I can't decide whether the current model or proposed one would be better or less surprising.

One tool we could use to get this behavior as a non-breaking change would be a "localizing layout" that had the behavior proposed here (or a similar param argument to the DR layout)—either of which would permit opting into this behavior without changing the default.

Alternatively, we could have DRs declared using the new "remote variable" mechanism (and/or those targeting GPUs) to have this behavior without changing ones that aren't, as in your second example.

Whether that could impact performance, I don't know

Based on how I think we implement on-clauses today, I believe it would add to the generated code size and compile time but not hurt the performance (due to fast-path optimizations for already-local on-clauses implemented using code clones).

vasslitvinov commented 3 months ago

I would consider changing the execution locale of forall a in Arr more seriously if it were not a 2.0-breaking change. In that case, I would suggest changing the locale of forall i in Dom for the same reason. Given stability constraints, I suggest we deprioritize this.

With GPU-allocated arrays this is different because we do not have stability constraints. We can say that GPU-allocated arrays are not Default Rectangular at least in some aspects. So, by analogy with Block arrays, we could justify that a forall co-locates with a GPU array.

I am on the fence whether a GPU array should be "like Block" or "like Default Rectangular" in this regard. What makes me think of the latter is that wrapping the forall within on Arr is an easy thing to add, once the user knows they need to do it.

e-kayrakli commented 3 months ago

I'll start by adding some more cases to keep in mind:

on Locales[1] var Arr: [1..n] int;
A += 1;

I feel strongly that we don't want communication for the increment, but we do get fine-grained comm today. One can argue that we can handle promotion differently, but then we break the symmetry between promotion and foralls, which would be a step in the wrong direction in my view. Further complicating matters:

  on Locales[1] var Arr1: [1..1000] int;
  on Locales[2] var Arr2: [1..1000] int;
  on Locales[3] var Arr3: [1..1000] int;
  Arr1 = Arr2 + Arr3;

where do you want the promotion to execute? Locales[1] is the most natural place for me by far because this is analogous to forall .... in zip(Arr1, Arr2, Arr3) where the locality is dictated by the leader.


@bradcray

Granted, we're discussing an anti-pattern (i.e., maybe Arr should've just been declared within the on-clause?), but…

I don't see this as an anti-pattern. At least thinking about your parenthetical significantly reduces the power of remote variable declarations in my mind.

But on the other hand, what if the body of the second loop accesses tons of data on locale 1 such that the current behavior is what I wanted; in that case, the only recourse I have is to put an on-clause within the body of the forall loop, but that's going to be very expensive. Or, I suppose I could do something like rewrite it to:

on Locales[1] {
  const ArrDom = Arr.domain;
  forall i in ArrDom do
    writeln(here);  // and substituting `A[i]` for `a` in any loop that referred to the elements from the original
}

but that seems pretty onerous.

Yes, I think this should be the way to make things local today. While I agree that adding that one line is unfortunate, it is not much different than localizing remote arrays and operating on them. It is pretty symmetrical in fact. Instead of localizing the data, we are localizing the execution.

I do think there is a room for improvement in the interface, though. on intents is one thing that comes to mind (https://github.com/chapel-lang/chapel/issues/18585):

on Locales[1] with (in ArrDom) { // or put your line in here if ArrDom doesn't exist
  forall i in ArrDom do
    writeln(here);  // and substituting `A[i]` for `a` in any loop that referred to the elements from the original
}

could be a way, for example. Or we can consider creating copy methods on arrays and domains which could be useful in other contexts, too:

on Locales[1] {
  forall i in Arr.domain.copy() do
    writeln(here);  // and substituting `A[i]` for `a` in any loop that referred to the elements from the original
}

This feels fairly analogous to when people loop over a range and then use it to access a distributed array and are surprised that the loop completely executes on locale 0. But I can't decide whether the current model or proposed one would be better or less surprising.

This is a good analogy. In my mental model, the reason for that behavior is because the range is declared on locale 0 (by default). If you were to declare it on some other locale (by opting in via remote variable declarations), the loop would execute on that particular locale. I think this is a fairly consistent and less surprising view of locality with forall loops.

One tool we could use to get this behavior as a non-breaking change would be a "localizing layout" that had the behavior proposed here (or a similar param argument to the DR layout)—either of which would permit opting into this behavior without changing the default.

Alternatively, we could have DRs declared using the new "remote variable" mechanism (and/or those targeting GPUs) to have this behavior without changing ones that aren't, as in your second example.

These feels inconsistent and less productive than my suggestion. Remote-declared arrays and others should behave consistently. Using a separate layout seems unfortunate as I view remote variable declarations to be the right tool for that job already. If we were to consider using layouts for GPU-based arrays, I'd be more open to doing that without remote variable declarations:

var GpuDomain = {1..10} dmapped new DeviceArr(); // or "localizing layout" for more general cases
var GpuArr: [GpuDomain] int;

seems good enough for me. Throwing in a remote variable declaration for the purpose feels redundant. (While remote variable declaration instead of layout is still my preference for this pattern)


@vasslitvinov

With GPU-allocated arrays this is different because we do not have stability constraints. We can say that GPU-allocated arrays are not Default Rectangular at least in some aspects. So, by analogy with Block arrays, we could justify that a forall co-locates with a GPU array.

I prefer not to make GPU arrays different than other arrays. Note also that when we have distributed array support, one can literally use a block distribution for this purpose as well.

var GpuArr = blockDist.createArray(1..10, int, targetLocales=[here.gpus[0],]);

forall elem in GpuArr do // this is a kernel
                         // (it should also be the case if this was a
                         //  remote-declared array)

General concern about language stability: I understand the concern here, but I was thinking (could be the wrong starting point on my part) that this is a bug. We don't have many patterns where this is exposed, so, we ended up with the behavior that we have today. Remote variable declarations will expose this issue a bit more clearly, so we will fix the issue.

I can also argue that this is beyond the stability guarantees we have. The change here will not break existing code; it will make the common case (or what I perceive to be the common case) go faster.

mppf commented 3 months ago

With the caveat that I've only skimmed the above discussion:

I share the concern about it being a potentially breaking change. Perhaps this is a use case for language editions?

IMO the reason this is so tricky is when considering generic code working with arrays. If I have to add the on clause like this:

on MyArray {
  forall elt in MyArray { ... }
}

that becomes potentially confusing and potentially slower, and at least useless, when MyArray is distributed. (Since forall elt in MyArray already runs distributed in MyArray is distributed). I think it's particularly confusing because with on MyDistributedArray, the distributed array isn't in just 1 locale.

Also, I recall at some point being surprised about something related... I think it was that a for elt in MyBlockDistributedArray runs on a single locale. I thought I made an issue about that but can't find it.

bradcray commented 3 months ago

For my part, I found Engin's responses yesterday reasonably convincing, but haven't had the time to reply to specific points about it yet.

Michael, I'm not really understanding your "if I have to add the on clause…" bit. Are you saying that if we preserve the current behavior and you as a user would have to add it? And is the implication that that makes you in favor of following the proposal here (with the caveats/concerns about breaking behavior)?

Also, I recall at some point being surprised about something related... I think it was that a for elt in MyBlockDistributedArray runs on a single locale.

There's definitely been a longstanding question here about making this follow the location of the elements. I think the main reason we haven't is that it's somewhat difficult to code up cleanly/efficiently without doing an on-clause per element (for a multidimensional block-distributed array, for example). I've never viewed this as something that's "locked in" by the current language definition or implementation, though I also suspect we never talked about it explicitly w.r.t. Chapel 2.0 and stability.

mppf commented 3 months ago

Are you saying that if we preserve the current behavior and you as a user would have to add it? And is the implication that that makes you in favor of following the proposal here (with the caveats/concerns about breaking behavior)?

Yes. Put another way, I was saying it would be unsatisfying to need to add on MyArray around a forall elt in MyArray when writing generic code that should handle distributed or local arrays or GPU arrays.

e-kayrakli commented 3 months ago

Adding one more case:

on Locales[1] var Arr: [1..10];
var sum = + reduce Arr;  // where do we do this reduction?
                         // Note that this lowers into a forall

To be clear, sum should be on Locales[0]. The result will need to be copied across the network. But the reduction should be local to the array.

vasslitvinov commented 3 months ago

I am still concerned about backwards compatibility and subtle performance issues popping in existing code if we change this. Another thought:

on Locales[1] var Arr: [1..10];
var sum = + reduce Arr;  // where do we do this reduction?

If I wanted to run the reduction on Locales[1], I would wrap that entire snippet in on Locales[1], not just the array declaration.

bradcray commented 3 months ago

If I wanted to run the reduction on Locales[1], I would wrap that entire snippet in on Locales[1], not just the array declaration.

Not if you wanted sum to live on locale 0.

I am still concerned about backwards compatibility

I think that's a reasonable concern and if we had a language board, I'd definitely want them to sign off on this. But also, I would also ask: Where do we specify how forall loops over arrays (of any kind) are implemented today, and do you think we are unable to make any changes to how they're implemented at present that isn't clearly buggy?

subtle performance issues popping in existing code

That seems like something we can screen for in key codes. It would also be easy to add an execution-time warning to call out cases that this behavior change would affect. In fact, it'd be interesting to implement that and see what multilocale tests break in our test system (i.e., where are we relying on the current behavior, and which of those cases are ones that would benefit from this change vs. be hurt by it?)

bradcray commented 3 months ago

Catching up on Engin's main response to my previous comment:

Granted, we're discussing an anti-pattern (i.e., maybe Arr should've just been declared within the on-clause?), but…

I don't see this as an anti-pattern.

Sorry, I was probably unclear—in saying that, I was still talking specifically about the second case from the OP:

what if the body of the second loop

where it seems as though the code would be much more natural/reasonable if 'Arr' had been declared within the on-clause:

on Locales[1] {
  var Arr: [1..1] int;
  forall a in Arr do
    writeln(here);  // says LOCALE1, expect LOCALE0
}

E.g., in a code review, I don't think we'd ever encourage anyone writing it the first way to keep it that way.

on intents is one thing that comes to mind

:+1:

This is a good analogy. In my mental model, the reason for that behavior is because the range is declared on locale 0 (by default). If you were to declare it on some other locale (by opting in via remote variable declarations), the loop would execute on that particular locale.

I don't think that's the case today, is it? Or maybe you're saying that under this proposal, you think it should be? Exploring that incrementally: If we were to follow this proposal, it seems to me that—without optimization—domains should follow suit. That is, if I had code:

var D = {1..n};
on Locales[1] {
  var A: [D] real;
  forall i in D ...
    A[i] = …;
}

Under this issue's proposal, it seems like the loop should execute on locale 0 since, by analogy, if D was a block-distributed array it would move the iterations to the corresponding locales, so the local domain case should presumably do the same as well, similar to the local array case. But the range case is much less clear to me:

var r = 1..n};
on Locales[1] {
  var A: [r] real;
  forall i in r do
    A[i] = ...;
}

Traditionally, I've thought of 'r' as being more like a scalar value than a significant, locality-oriented object, so it seems more surprising to me that this would result in a location change (and it definitely doesn't today). But it would also arguably be weird to make the two codes above act differently. And maybe I've just been living in the current world for too long that it's taking me longer to have my mind adjust to this case given that there isn't a distributed analogy for it.

Unforunately (or fortunately?), we don't have many other built-in iterable types in the language to consider in weighing this question. Though I suppose standard libraries like 'List' would probably want to follow suit (but also don't today). Somehow, the standard libraries seem less worrisome to me, I suppose since they are simply not baked into the language.

I was thinking that this is a bug.

My initial reaction to that was mild discomfort, because I think we went into the current definition very intentionally and with our eyes open (or, at least, I did), so it's not as though it's behaving differently than expected/intended. But on reflection, I could see an argument that it was an oversight/inconsistency, especially since I don't think we've documented the expected behavior anywhere (although one might argue that in the lack of the documentation, the on-clause would win…).

e-kayrakli commented 3 months ago

This is a good analogy. In my mental model, the reason for that behavior is because the range is declared on locale 0 (by default). If you were to declare it on some other locale (by opting in via remote variable declarations), the loop would execute on that particular locale.

I don't think that's the case today, is it? Or maybe you're saying that under this proposal, you think it should be? Exploring that incrementally: If we were to follow this proposal, it seems to me that—without optimization—domains should follow suit.

Yes, yes. I was explaining that the reason for iterating over a range while accessing a distributed array causing communication is because the range makes sure that the execution is happening on the locale it lives in (not literally what's happening today, but that's how I was making sense of the behavior). So, to me, it follows very naturally that your examples with the range and the domain should also move execution.

I must admit that the range case is unclear to me, as well. While in your case:

var r = 1..n};
on Locales[1] {
  var A: [r] real;
  forall i in r do
    A[i] = ...;
}

I might be upset with the user that they should have just used A.domain as it is way more idiomatic, there could be cases like:

var r = 1..n;
var inner = r.expand(-1);
on Locales [1] {
  var A: [r] real;
  forall i in inner {
    A[i] = ...;
  }
}

which is a bit more realistic.

That being said, shouldn't inner be RVFed under any circumstance there? It looks like it doesn't get RVF'ed, potentially because its iterator was being used? In any case, I think it should be which makes me feel a bit conflicted:

Philosophically, this last bit of conflicted feelings also makes me think that such locality matters should never be stabilized. Different optimizations could alter locality in ways that could be visible to the user. Different compilers/runtimes of the language could also make different decisions.

bradcray commented 3 months ago

shouldn't inner be RVFed under any circumstance there?

I agree that it potentially should be, but it may be that we only RVF sufficiently simple types and ones flagged as RVFable? Anyway, I was avoiding worrying too much about RVF in any of these scenarios for now since I view it as largely orthogonal to what we're discussing here, which is how should the iterators be implemented w.r.t. locality (?). E.g., the domains and arrays could potentially also be rvf'd in a future world where they were sufficiently const and compact?