[Multiresolution Design Philosophy] Expose user-friendly chpl_comm primitives

LouisJenkinsCS commented 5 years ago

As part of multiresolution design philosophy, I believe that the user should have access to low-level primitives, especially to something that is considered relatively stable, such as to the communication layer (and honestly even the tasking layer, but that is a separate issue). In particular, the user should be allowed to explicitly call PUT and GET, and there should be a user-facing API for this.

module CommPrimitives {
   proc commGet(ref obj : ?t) : t;
   proc commPut(ref dest : ?t, src : t) : void;
   proc commGetNB(ref obj : ?t) : Future(t);
   proc commPutNB(ref dest : ?t, src : t) : Future(void);
}

Where Future can wrap the chpl_comm_nb_handle_t objects. I believe this can be rather valuable for siphoning raw performance that comes from having a priori knowledge of the algorithm. As well, some overloads, say up to 8 parameters of various types, would be appreciated (or even variadic if that is possible).

Currently, the thing stopping me from creating this myself is getting the int32_t typeIndex argument, which seems to be generated by the compiler and does not seem to be exposed via compiler primitives. Hence if the assistance of the compiler is needed, it would be nice to have whatever minimal changes to the compiler are necessary to make this happen.

mppf commented 5 years ago

I agree that this would be useful. I've also been wondering if we should expose something that looks like an existing interface, like OpenSHMEM, but I suspect that might make the problem harder than it needs to be.

The typeIndex argument should be computed by the compiler based on the type of one of the arguments to the primitive. For example, see https://github.com/chapel-lang/chapel/blob/b61fb6ecda7cbe1e41898eeb7f91ee2827cc38b3/modules/standard/GMP.chpl#L1145-L1148 . I think it'd be possible to implement the API you sketched using __primitive today.

LouisJenkinsCS commented 5 years ago

Cool! Is there a non-blocking primitive?

mppf commented 5 years ago

No, not yet. I've (personally) been hoping to migrate these primitives to use extern calls to the runtime. In that event we would add a primitive to compute the type ID you mentioned.

ronawho commented 5 years ago

especially to something that is considered relatively stable, such as to the communication layer

The comm layer and runtime API don't change that often, but they aren't stable and I don't want to make them stable since that could inhibit future optimizations. I'm on board with exposing a low-level set of communication primitives, but I don't want to set any expectations that the runtime API itself is stable or user-facing. Maybe we could expose a version or something to help with compatibility, but even that adds a decent amount of maintenance overhead.

In terms of your user-facing API -- I wonder if we really want to expose get/put or if a unified copy routine would be cleaner. As an example think of a case like:

var a = 1;
on Locales[1] {
  var b = 2;
  on Locales[2] {
    // how to express a=b using primitives?
    // (put/get require that one side be local)

    // very broken, b doesn't live on locale 2
    commPut(a, b); 

    // correct (what the compiler would insert for a=b), but ugly
    var tmp = commGet(a); 
    commPut(a, tmp);

    // internally, have this routine do get, put, or getput
    commCopy(a, b);
  }
}

LouisJenkinsCS commented 5 years ago

proc commPut(ref dest : ?t, src : t) : void would have to obtain the value of b first, as it is is passed by value. These are low-level primitives, but if you want some kind of error-proofing, you could implement commPut to take the second argument by ref and use a local block or something when getting the value.

With regards to commCopy, so long as a non-blocking version of this exists, I'm fine with that. However one significant thing I would like would be to perform non-blocking puts and gets to multiple locations. If commCopy returns a chpl_comm_nb_handle_t (or a wrapper for it it), I'm fine with that since I can then collect a series of requests and construct my own future.

var x : int;
var y : real;
var z : [1..100] int;
on Locales[1] {
   // commCopyNB specific overload for arrays/slices
   var handles = (commCopyNB(x, 5), commCopyNB(y, 3.14), commCopyNB(z[1..100], 1..100));
   async(
      lambda(handles : 3 * nbHandleType) { 
         for handle in handles do commWait(handle); 
      }, handles
   );
}

ronawho commented 5 years ago

Something interesting to think about -- do you require non-blocking operations to be ordered, or is it ok if the break they memory consistency model. e.g.

var a = 0;
on Locales[1] {
  commPutNB(a, 1);
  commPutNB(a, 2);
  commWait(...)
  // what are the legal values of a?  1? 2? either?
}

If you care about ordering, the runtime has to do a lot more work to ensure these PUTs occur in order especially on networks with adaptive routing.

If you don't care about ordering or memory consistency -- maybe you want something like unordered operations such as UnorderedCopy. On a Cray-XC we map unordered operations to chained transactions (buffer operations and only talk to NIC once per buffer instead of once per operation.) This can be ~2.5 faster than just non-blocking ops, but the benefits really depend on the message size and the underlying network.

I don't think I have a specific point here, just some general musings:

as soon as you move away from blocking/ordered operations you get into some fun areas of memory consistency
if a user is allowed to specify their intent instead of the exact implementation, the runtime can optimize based on the hardware/network (i.e. do you really want to force the runtime to do a non-blocking put if it could do something faster with a little more information)

LouisJenkinsCS commented 5 years ago

do you require non-blocking operations to be ordered, or is it ok if the break they memory consistency model

Yes it is okay for it to be unordered because it's a non-blocking. If I'm doing a non blocking put, then I'm fully aware of the consequences and welcome the benefits. Emphasis on low-level, let me shoot myself in the foot because if there wasn't a standard way to do it, I'd find some less safe way to do it anyway.

Second, I really don't like the specific targeting of Cray machines. If under the hood, commCopy does this on uGNI, great, but if I want to run it on infiniband and over udp (where non-blocking communication can really benefit me), I want the same code to work.

As to whether this interferes with the compiler and runtime, it comes down to "let me shoot myself in the foot". There certainly are times where I know better about how communication should be in my program than the compiler, and I plan to use these non-blocking constructs when such cases arise, not before profiling.

ronawho commented 5 years ago

There is a distinction between non-blocking and unordered. In UPC for example, non-blocking operations from the same thread are ordered.

UnorderedCopy on ibv and other networks will soon map down to non-blocking comm or whatever is fastest for that particular network. All I meant was that non-blocking ops aren't always the fastest you can get, so if you can, leave the names vague so the runtime can do whatever is best for the hardware.

My intention was to convey some of the problems and corner cases we have run into while working on similar issues. If you just want to build what you have described in the issue then use primitives and externs to call the runtime functions and note that since int32_t is a fixed size integer you just use int(32) in chapel. The runtime API will change over time, so don't expect it to be stable.

LouisJenkinsCS commented 5 years ago

There is a distinction between non-blocking and unordered. In UPC for example, non-blocking operations from the same thread are ordered.

I see, I suppose that makes sense. I'm assuming that in the communication layer, non-blocking communications get dispatched in batch so that both commPutNB(a, 1); commPutNB(a, 2); can occur in whatever order the communication layer wants. In that case, I'd suspect that if you desired some kind of ordering, it would be nice to maintain some illusion of sequential consistency.

UnorderedCopy on ibv and other networks will soon map down to non-blocking comm or whatever is fastest for that particular network.

I see, I see. UnorderedCopy does seem rather promising. My apologies for dismissing the suggestion. Although right now it only works on numeric types, hm? I suppose that would suffice for most applications, and is satisfactory for my short-term needs as well.

If you just want to build what you have described in the issue then use primitives and externs to call the runtime functions and note that since int32_t is a fixed size integer you just use int(32) in chapel.

I still need to obtain the typeIndex from a user-specified type. I was thinking of experimenting with this constants I found in the compiler

https://github.com/chapel-lang/chapel/blob/986025d1bfdbc7d0281cd893597ac13e1bdd61bd/third-party/llvm/llvm-src/include/llvm/DebugInfo/CodeView/TypeIndex.h#L27-L91

Maybe that can help for determining what 32-bit constant I need to represent a type.

LouisJenkinsCS commented 5 years ago

Also quick question @ronawho does UnorderedCopy allow specifying a wide-reference as the dest argument? As in, can we use it to perform unordered PUTs as well?

ronawho commented 5 years ago

In terms of correctness/functionality -- UnorderedCopy supports arbitrary wideness (src/dst both local, src remote, dst remote, src/dst remote)

In terms of performance -- Today UnorderedCopy only optimizes GETs (the src remote case) and it only supports numeric types and is only optimized on Crays. In the near term we will optimize PUTs and GETPUT as well, extend the optimization to any POD type, and extend the optimization to non-ugni comm layers.

My expectation is that for non-ugni comm layers we will map UnorderedCopy to non-blocking transactions. For POD types larger than ~32 bytes we'll probably also use non-blocking transactions under ugni too.

Unordered operations can fully saturate the network injection rate for small messages on a cray. It took a while in order to figure out how to get that performance in a mechanism that is at least somewhat usable for advanced users. Recent work has been focused on simplifying the runtime implementation and moving stuff into common code so it's easier to flesh out the API and extend to other comm layers.

ronawho commented 5 years ago

And to be clear -- I think there is still value in exposing other low level comm primitives. UnorderedCopy may not be exactly what you need in all cases. I brought it up because:

the prototype name for unorderedCopy was getBuff -- this was a bad name because it only worked for literal GETs (dst had to be local) and it specified too much about implementation details. With a vaguer name, we have a lot more flexibility in terms of how we map that down to the network.
the MCM implications get hard really fast. unorderedCopy completely breaks the MCM. non-blocking ops on their own will break MCM in the same way without extra work from the runtime to maintain ordering of operations issued from the same task.

ronawho commented 5 years ago

I'm assuming that in the communication layer, non-blocking communications get dispatched in batch so that both commPutNB(a, 1); commPutNB(a, 2); can occur in whatever order the communication layer wants. In that case, I'd suspect that if you desired some kind of ordering, it would be nice to maintain some illusion of sequential consistency.

I could be wrong, but I don't think you can batch up non-blocking ops easily (unordered can/does, but I think non-blocking has to send immediately, or you need a background thread to occasionally flush buffers.) e.g. for non-blocking ops I think the following program is legal, but it would deadlock if you buffered. You either need a background thread to flush buffers, or for seq_cst atomics to flush buffers or something.

var a, b: atomic int;
begin on Locales[1] { a.addNonBlocking(1); b.waitFor(1); }
a.waitFor(1);
b.add(1);

mppf commented 5 years ago

+1 for adding some kind of copy and using unorderedCopy for the non-blocking operations

chapel-lang / chapel

[Multiresolution Design Philosophy] Expose user-friendly chpl_comm primitives #13052