chapel-lang / chapel

a Productive Parallel Programming Language
https://chapel-lang.org
Other
1.8k stars 423 forks source link

Reduce communication for distributed array fields #10160

Open bradcray opened 6 years ago

bradcray commented 6 years ago

Summary of Problem

When declaring a class object (say) with a distributed array field, it seems as though we need a way to access that distributed array field without having to communicate back to the locale owning the original object. But at present we don't do this, resulting in a lot of expensive and arguably unnecessary communication.

The reason we don't do this today is that the object lives on a single locale, so we go back to that object to refer to the field. The reason it seems we should not need to do it is that a distributed array field has a local privatized copy on every locale, so it seems as though we should be able to optimize such accesses to remote value forward the ID of the privatized object and refer to it directly without going back to the original object and field.

As a motivating example, the following future is designed to do a computation that feels as though it ought to be communication-free, yet is not.

Steps to Reproduce

Associated Future Test(s): test/distributions/privatization/commForDistArrField.chpl #10159

Configuration Information

bradcray commented 6 years ago

Brainstorming possible solutions to this issue:

LouisJenkinsCS commented 6 years ago

I'd think that the LICM optimization is a good approach, but also that it would be too reliant on what can be detected at compile-time. For example, what if the array is accessed via a class method? To copy paste your Future test here:


use BlockDist, CommDiagnostics;

config const n = 100,
             printArray = true;

class C {
  const D = {1..n} dmapped Block({1..n});
  var A: [D] real;
}

var myC = new owned C();

for loc in Locales do
  on loc {
    resetCommDiagnosticsHere();
    startCommDiagnosticsHere();
    for i in myC.D.localSubdomain() do
      myC.A.localAccess(i) = i;
    stopCommDiagnosticsHere();
    writeln(here.id, ": ", getCommDiagnosticsHere());
  }

if printArray then
  writeln("myC.A is: ", myC.A);

What happens if C has the array as private, and so the user needs to access the array via a 'getter' method that is complicated enough that the compiler may not be able to determine that the array itself should be hoisted?


use BlockDist, CommDiagnostics;

config const n = 100,
             printArray = true;

class C {
  const _D = {1..n} dmapped Block({1..n});
  var _A: [_D] real;
  proc this(idx) ref {
    // Imagine that a lot of computation is done here so it can't be handled trivially...
    return _A[idx];
  }
}

var myC = new owned C();

for loc in Locales do
  on loc {
    resetCommDiagnosticsHere();
    startCommDiagnosticsHere();
    for i in myC.D.localSubdomain() do
      myC[idx] = i;
    stopCommDiagnosticsHere();
    writeln(here.id, ": ", getCommDiagnosticsHere());
  }

if printArray then
  writeln("myC.A is: ", myC.A);

My assumption is that LICM by itself won't be sufficient in that a lot of the computation on the remote instance myC won't be elided, compared to if we allowed the user to be make well-informed choices about privatization and if we had privatized the instance itself.

cause the class itself to be privatized? (seems problematic for the non-privatized fields)

From what I can see, the only time it becomes problematic is if the user is uneducated about privatization, like if it is done implicitly without them knowing. If the user knew that privatized instances are locale-private, they would be able to work around this issue. If they desire a shared kind of data structure, then we could possibly implement a Shared class that allocates it on a single node and forwards all access to the Shared object to the object it is wrapped. For example...

class Shared {
   type objType;
   var _value : objType;
   forwarding _value;
}

var counter = Shared(atomic int(64));
coforall loc in Locales do on loc do counter.fetchAdd(1);

I would expect the above to work just fine for the user. The data is allocates on Locale 0 (or whichever node the counter gets allocated on) and is accessed in a shared fashion. If the user were able to do this, I believe it would make it apparent certain fields are shared as well (I.E it should be a standard library package if we go this route). For privatizing class automatically, I'd think it'd be cool if there was a wrapper, Distributed, that performs privatization for the user.

record Distributed {
   type objType;
   var pid = -1;
   proc init(args...?n) {
      var obj = new objType((...args));
      pid = obj.pid;
   }
   proc _value {
      if pid == -1 then halt("Uninitialized...");
      return chpl_getPrivatizedCopy(objType, pid);
   }
   forwarding _value;
}

var distList = Distributed(list(int));
coforall loc in Locales do on loc do distList.push_back(1..1024);

Just an idea.

Edit:

To elaborate on the example of the Distributed wrapper, I believe that if a data structure is meant to be used with the Distributed keyword, it should be implement some kind of interface in the end-result of eliminating most of the boiler-plate behind record-wrapping. In reality, a trait or something similar would be more appropriate but this may work in the short-term.

class DistributedList {
   // Note: Global atomics not yet supported, but imagine if it was
   var head : Shared(atomic Node);
   var tail : Shared(atomic Node);
   // Imagine that the distributed list supports aggregation/coalescing and it
   // allows the user to append to a local list before pushing globally
   var localHead : atomic Node;
   var localTail : atomic Node;
}

// Assume they implemented some kind of `push_back` or `append` method
var localList = new DistributedList(args);
var distList = new Distributed(DistributedList, args);
coforall loc in Locales do on loc {
   // Both perform these globally...
   localList.append(someValue);
   distList.append(someValue);
   // Since `distList` is privatized, this will actually be performed locally
   // Since localList isn't, it will just update the global 'localList'
   localList.appendLocal(someValue);
   distList.appendLocal(someValue);
}

In the above, the Distributed keyword performs automatic privatization, but without there will be only a single instance. Its not pretty, but it can work as a library.

bradcray commented 5 years ago

@chapel-lang/core-contributors: I think we need to really dig into this issue soon. I'd hoped my work on optimizing array slices would lead me to it faster than it has, and am probably not the best source of spare cycles. I'd be happy for anyone to work on it, though @benharsh and @mppf feel like potential candidates.

bradcray commented 5 years ago

Thinking about user-level workarounds for this issue...

Interestingly, if you capture a reference to the field in question, you can reduce the number of communications, but not eliminate them. Is this due to the reference being wide perhaps? For example, for the future above, if the loop is turned into:

    ref myCA = myC.A;
    for i in myC.D.localSubdomain() do
      myCA.localAccess(i) = i;

we go from ~500 comms per locale at --n=1000 to ~250. Better, but still bad.

@benharsh, do you have any tricks up your sleeve for knocking out the remaining comms?

benharsh commented 5 years ago

It's pretty ugly, but you can replace that ref line with this:

pragma "no copy"
var myCA = new _array(myC.A._pid, myC.A._value, _unowned=true);

And then there's only about 7 GETs per locale.

I think the 'ref' line mostly works around 'myC' being owned. If you make it unmanaged then you see the same change (500 --> 250) without the ref line.

The other source of GETs due to the localAccess line effectively turning into:

myCA._value.dsiLocalAccess(i) = i;

The "myCA" variable in your code is a reference to an _array record on another locale. When we invoke the "_value" method, we use the "_pid" field of the record to get the local privatized copy of the BlockArr class. Ideally the "_value" portion could be hoisted out of the loop. My snippet basically hoists the _value portion out of the loop manually, and uses some arcane internal stuff to get a proper "_array" type.

bradcray commented 5 years ago

Thanks for the analysis and workaround, and also "shoot": I'd hoped that a ref to a privatized array would somehow automatically result in referring to the local copy of the privatized array... :(

bradcray commented 5 years ago

FWIW, The more I've been thinking about this, the more it strikes me that the way to address it is to have RVF start looking beyond whole variables and to also consider fields that could be RVF'd, replacing their object.field accesses with a reference to the RVF'd constant. Since PID fields aren't modifiable, it seems like they'd be obvious / simple candidates for such an optimization (though it'd be nice if the same happened for other const int, const real, const string fields and the like...)

mppf commented 5 years ago

Just for the purposes of this post, I'm going to compare performance of a few different approaches that exist now. So I'll start with the current default behavior:

$ chpl test/distributions/privatization/commForDistArrField.chpl  --fast 
$ ./commForDistArrField -nl 2
0: (<no communication>)
1: (get = 103)
myC.A is: ...

This is about 2 GETs per iteration in the loop:

for i in myC.D.localSubdomain() do
      myC.A.localAccess(i) = i;

Now, I think it'd be reasonable to extend RVF to handle this case. That said, such an effort wouldn't help with the more general problem. What if, for example, there is an array of class instances, where each instance contains a distributed array field? Then an access like instances[i].distributedArrayInField[j] would still lead to communication to get the PID.

  • when the compiler finds classes with privatized fields, could it...
    • cause the class itself to be privatized? (seems problematic for the non-privatized fields)
    • split the class into privatized and non-privatized pieces?
    • create optimized field setters/getters for the privatized fields?
    • hoist such privatized fields out of the class (but then how would it access such fields?)

Could the compiler extend classes with distributed fields to be privatized? I think this is an interesting idea but I don't know how to solve the problems that come up. (That doesn't mean that they can't be solved, though). I think the basic problem is that we'd expect a variable referring to the class to contain only the (wide) class instance pointer (rather that the wide pointer and also privatization information for the fields). One response to that problem would be to propose some kind of per-locale data structure that mapped from the address of the class instance to the pid. Actually we have one of those, it is --cache-remote. And it already works for this case:

$ chpl test/distributions/privatization/commForDistArrField.chpl  --fast --cache-remote
$ ./commForDistArrField -nl 2
0: (<no communication>)
1: (get_nb = 4)
myC.A is: 

Cool, the number of GETs no longer depends on the number of loop iterations!

Naturally, you might say "But the cache will discard the pid on a fence! It doesn't need to do that!". That's true, but perhaps there is a way to extend --cache-remote so that it knows not to discard const data on a fence.

When I was first working on --cache-remote, I was talking about having a different heap / memory segment for const allocations. Supposing (as a thought experiment) that we did that, the cache could tell from an address whether or not it was in constant memory, and it could leave cached the constant data, even on a fence. There are two problems with this:

  1. The ABA problem. You might have a const class instance that is allocated, then deleted, and finally the space reused for a new object, which is also const but has different data.
  2. We don't have const instances at all today, and it'd be quite hard to split classes into two pointers (one for const fields and one for not-const fields).

So, we could contemplate using a more compiler-driven approach to 2., where the compiler would inform the --cache-remote cache which portions of memory (based on the type system) referred to constant data. The --cache-remote implementation might even use totally different data structures to cache the constant data.

But that leaves 1. still as a challenging problem. I think there's probably a reasonable way to solve it using some of the techniques @LouisJenkinsCS has been investigating, such as Quiescent State Based Reclamation (QSBR). I'd imagine such a solution might have these properties

Any such approach would also have to take care with const data on a task stack on another locale, which will be possibly reused in a different way fom the new/delete described above.

For privatizing class automatically, I'd think it'd be cool if there was a wrapper, Distributed, that performs privatization for the user.

This seems nice to have in some form, for those users that want to solve the problem directly.

we could alternatively implement LICM-style optimizations for such cases,

In fact we already have.

$ chpl test/distributions/privatization/commForDistArrField.chpl  --fast --llvm --llvm-wide-opt
$ ./commForDistArrField -nl 2
0: (<no communication>)
1: (get = 4)
myC.A is: ...

Cool, the number of GETs no longer depends on the number of loop iterations! In this case, the --llvm-wide-opt strategy allowed LLVM's LICM pass to hoist the GETs out of the loop.

bradcray commented 5 years ago

Actually we have one of those, it is --cache-remote. And it already works for this case: ... Cool...!

As noted in issue #10125, --cache-remote seems to result in hangs for CHPL_COMM=ugni and multi-locale programs.

Bummer!

ronawho commented 5 years ago

Relying on LICM (ours or llvms) seems worrisome to me because it's pretty easy to thwart LICM

Relying on remote-cache also seems worrisome because caches aren't always a win and I think we need to solve this problem consistently. I'd be curious to see how much overhead the current cache has for something like RA, where we're doing all random updates. I was going to try that, but the current implementation is hanging even for simple cases as Brad noted.

I have a relatively large code (that I can share internally) where a class has multiple distributed arrays, and there is currently an excessive amount of comm because of this issue. Cache remote hangs and llvm-wide-opt does not help for this code.

I'd be very nervous about solving this problem with a runtime cache, or with an inconsistent compiler optimization given that having distributed arrays in a class seems like it'll be a really common idiom that should just work. I liked the notion of "split the class into privatized and non-privatized pieces", but I don't understand the implementation implications there.

bradcray commented 5 years ago

I liked the notion of "split the class into privatized and non-privatized pieces", but I don't understand the implementation implications there.

I can't claim to either, but imagine that all privatized and const fields were hoisted into an outer object that was privatized, and all other fields were placed into a second, unique object that was referred to by the outer object.

mppf commented 5 years ago

I brought up this issue with it before:

What if, for example, there is an array of class instances, where each instance contains a distributed array field? Then an access like instances[i].distributedArrayInField[j] would still lead to communication to get the PID.

What if we viewed the task as changing the implementation of privatization? What if, instead of using an integerpid that is assigned once and across all locales, the privatization table mapped wide ptr -> local privatized copy? If it worked that way, then a locale, upon seeing a wide pointer that isn't privatized but could be, would access the data on the original/canonical locale (using the wide ptr) to create its own local privatized copy. Then, that local privatized copy could continue to exist until the object is deleted.

bradcray commented 5 years ago

I agree that a lazy privatization scheme would be valuable / attractive in general and keep running into wanting that in work on optimizing slices as well.

But I don't see how a different privatization scheme would help with the case in which classes have fields that aren't privatize-able / replicate-able. For example, in this class:

class C {
  var x: int;
  var y = newBlockArr(...);
  const z: int;
}

we can't really privatize / replicate x as we could for 'z' (since its constant) or the PID for (or other means of looking up) y locally since all locales should see the identical field x and any updates to it. This is what leads me to suggest splitting such classes into privatize-able fields and non-:

// compiler-generated class storing fields from C that aren't subject to replication
class C_unique {
  var x: int;
}

// original C modified to move 'x' out into a separate / non-privatize-able object
class C {
  var nonpriv = new class C_unique();
  var y = newBlockArr(...);
  const z: int;

  // setter / getter permitting `x` to be accessed as though it was a field on C
  proc x ref {
    return nonpriv.x;
  }
}
mppf commented 5 years ago

I agree that a lazy privatization scheme would be valuable / attractive in general and keep running into wanting that in work on optimizing slices as well.

👍

But I don't see how a different privatization scheme would help with the case in which classes have fields that aren't privatize-able / replicate-able.

I thought this issue wasn't about the other fields. But I think I did miss something - which is that even if we know how to map the address of the distributed array _instance to the privatized ID, we'd still communicate to the node owning the C instance to get that address. So part of the puzzle here is that the privatization has to be sortof transitive. I wonder if that might be easier to see how to do in the canonical address -> privatized instance description. In particular, perhaps the compiler could see that a field access like myC.DistributedArray is accessing data that will be const and privatized (based on the types involved) and as a result save in the table somewhere the (pointer) value of the DistributedArray field hanging off of the myC pointer. Then it could get the privatized instance from that.

In other words there would be two steps that are stored per-locale:

  1. Map from class instance + field -> field value for const fields (where the array instance for an _array is the const field of particular interest)
  2. Map from class instance -> privatized class instance.

While I think we could probably make (1) work with some compiler effort. Another approach would be to solve that part of the problem by giving users the ability to do privatization (hopefully more easily than today). However I hope we can get to a point where the simple class-with-distributed-array case works efficiently in any case, without user privatization code.

bradcray commented 5 years ago

I thought this issue wasn't about the other fields.

I think this issue focuses on the privatized fields, but doesn't insist that those are the only types of fields a class has. The first brainstorm comment I posted had some sensitivity to other fields.

by giving users the ability to do privatization (hopefully more easily than today).

I agree that this is attractive as well (see issue #10275) but (like you) think that we'd like to be able to do something smart in the case of naive users who are just composing language concepts without necessarily needing to know about privatization.

mppf commented 5 years ago

Map from class instance + field -> field value for const fields (where the array instance for an _array is the const field of particular interest)

Actually this map is even simpler than I initially imagined. That's because it doesn't need to know about fields at all. Instead, given a class instance pointer, the compiler can generate code to compute the address of the field with just an addition.

Map from class instance -> privatized class instance.

I think this is interesting as well because the whole business can be viewed as a sort of caching.

Here is how I'd imagine privatization working, in the terminology of caching. We could, but wouldn't necessarily, build this on top of --cache-remote. Since it has different rules, I'll call it a "privatization cache".

All of these privatized classes have a canonical version. I consider the privatized versions a form of "caching" remote data. This privatization cache would need to be optimized for const or rarely changing data. (Even const variable/field data can change - consider a const field in a class instance. That class can be deallocated and then another one can be allocated into the same memory. This is the ABA problem.). In the context of privatization for arrays and similar, the data is even less likely to be marked const in the source code but we want to optimize for the case in which it changes rarely. For example, for a Block distributed array, I'd expect the relevant rarely changing data to at least include:

Since we expect uses of the Block array will be much more common than operations changing those elements, we want to "privatize" the Block array fields. In a way, we might view that as wanting to use a more aggressive caching strategy than --cache-remote (and in particular doing it regardless of things like fences).

What I think is particularly interesting about this is that it can sidestep many of the issues with the --cache-remote style of caching. Why?

So, what would the pattern of operations be, for example with a Block array?

So how does this fit into a class like this?

class C {
  var x: int;
  var y = newBlockArr(...);
  const z: int;
}
mppf commented 5 years ago

I didn't really respond to one of @bradcray's earlier ideas and wanted to:

But I don't see how a different privatization scheme would help with the case in which classes have fields that aren't privatize-able / replicate-able. ... This is what leads me to suggest splitting such classes into privatize-able fields and non-:

I like the way that this strategy keeps the size of a pointer to C fixed (because I'd be worried about bundling arbitrary amounts of "privatization data" along with pointers to a C). However it adds indirection when accessing x that might present performance challenges.

Additionally I have been wondering if the current privatization strategy is overly complex to actually use (with Chapel code responsible for copying various bits of data around to different instances managed in Chapel code).

I think that the privatization cache I outlined also handles privatized and non-privatized fields elegantly. (The compiler knows, based upon the type/field being accessed, that the values of non-privatized fields aren't going to be found in the privatization cache, and so accesses them through normal means). I also think it would be much easier to use (it would become more about informing the compiler which fields should be privatized than implementing the privatization in the module code).

One issue with the compiler knowledge of which accesses to privatize is that the knowledge can be lost if a const ref is taken to a const field and then the const ref is passed to another locale. (Since the type of that might just be e.g. "wide pointer to int" at that point). My hypothesis here is that this case is actually already reasonably well handled by RVF.

benharsh commented 5 years ago

I'm currently favoring the caching approach over some kind of class-splitting.

I think a first step in the caching direction could be to cache the _array records instead of trying to cache arbitrary const fields or privatizing lazily. Michael mentioned offline that our invalidation mechanism could rely on the pre-existing _freePrivatizedClass, which seems straightforward enough (but maybe has performance issues?).

Between the two, the caching approach also seems like it would be easier to prototype.

Sorry if I've missed something in these long comments, but is one of these approaches more capable than another, or are we mainly talking about implementation details?

mppf commented 5 years ago

I've been thinking about this issue. I think combining it with some of the ideas in #8483 would be interesting. I'd be interested in giving it a try.

npadmana commented 4 years ago

@bradcray -- this has been niggling me for a few weeks/months now because my code uses this idiom pretty extensively, but I was unable to see any real slow down in the code (no matter how hard I tried looking).

It turns out that the following modification to your code (which is effectively what I had been doing in my code) takes you to no communication at all (edited to fix a bug, thanks @bradcray ).

use BlockDist, CommDiagnostics;

config const n = 100,
             printArray = false;

class C {
  const D = {1..n} dmapped Block({1..n});
  var A: [D] real;
}

var myC = new owned C();

ref A = myC.A;
const Dom = A.domain;

for loc in Locales do
  on loc {
    resetCommDiagnosticsHere();
    startCommDiagnosticsHere();
    const myDom = Dom.localSubdomain();
    for i in myDom do
      A.localAccess(i) = i;
    //Note that writing the loop as this also works well!
    //for i in A.localSubdomain() do
    //  A.localAccess(i) = i;
    stopCommDiagnosticsHere();
    writeln(here.id, ": ", getCommDiagnosticsHere());
  }

if printArray then
  writeln("myC.A is: ", myC.A);

Compiling and running

$ chpl --fast test.chpl 
$ ./test -nl 2 --n=1000
0: (<no communication>)
1: (<no communication>)
$ ./test -nl 4 --n=1000
0: (<no communication>)
1: (<no communication>)
2: (<no communication>)
3: (<no communication>)
$ chpl --version
chpl version 1.22.0
Copyright 2020 Hewlett Packard Enterprise Development LP
Copyright 2004-2019 Cray Inc.
(See LICENSE file for more details)

Apologies if you already know of this... but I couldn't find a user level fix, and this seemed pretty easy.

bradcray commented 4 years ago

@npadmana: In your rewrite, aren't you doing n iterations per locale when you'd want to be doing ~n/numLocales? And then when compiling without --fast, you should be getting OOB errors for those indices in D that are not owned by a given locale. Is there a chance you do something different in your real code than your simple example?

npadmana commented 4 years ago

@bradcray -- good catch. I just updated the above example -- it should have been iterating over the subdomain. Note that this works also when I directly just iterate over the subdomain of A as in the commented region.

bradcray commented 4 years ago

Thanks for the workaround, Nikhil. It's involved enough that, for the typical user, it just makes me want to solve the root issue even more. But it's good to know that it exists.

As a side note, the const Dom = A.domain; line will not be super-cheap (it creates a new distributed domain, so will talk to all locales in doing so), though recent optimization work from @e-kayrakli in Chapel 1.22 should make it less expensive than it used to be. Making it into const ref declaration instead seems to incur 3 gets within the section being measured (regardless of n) and may be worth it to avoid the creation of Dom depending on the use case.

e-kayrakli commented 4 years ago

const ref should be better for sure. We are in the order of hundreds of GETs with 4 locales for creating a distributed domain.

Here are distributed domain creation communication count numbers:

https://chapel-lang.org/perf/chapcs.comm-counts/?startdate=2019/08/18&enddate=2020/07/06&graphs=creatingdistributeddomains