Open LouisJenkinsCS opened 5 years ago
I agree that this would be useful. I've also been wondering if we should expose something that looks like an existing interface, like OpenSHMEM, but I suspect that might make the problem harder than it needs to be.
The typeIndex
argument should be computed by the compiler based on the type of one of the arguments to the primitive. For example, see https://github.com/chapel-lang/chapel/blob/b61fb6ecda7cbe1e41898eeb7f91ee2827cc38b3/modules/standard/GMP.chpl#L1145-L1148 . I think it'd be possible to implement the API you sketched using __primitive
today.
Cool! Is there a non-blocking primitive?
No, not yet. I've (personally) been hoping to migrate these primitives to use extern calls to the runtime. In that event we would add a primitive to compute the type ID you mentioned.
especially to something that is considered relatively stable, such as to the communication layer
The comm layer and runtime API don't change that often, but they aren't stable and I don't want to make them stable since that could inhibit future optimizations. I'm on board with exposing a low-level set of communication primitives, but I don't want to set any expectations that the runtime API itself is stable or user-facing. Maybe we could expose a version or something to help with compatibility, but even that adds a decent amount of maintenance overhead.
In terms of your user-facing API -- I wonder if we really want to expose get/put or if a unified copy routine would be cleaner. As an example think of a case like:
var a = 1;
on Locales[1] {
var b = 2;
on Locales[2] {
// how to express a=b using primitives?
// (put/get require that one side be local)
// very broken, b doesn't live on locale 2
commPut(a, b);
// correct (what the compiler would insert for a=b), but ugly
var tmp = commGet(a);
commPut(a, tmp);
// internally, have this routine do get, put, or getput
commCopy(a, b);
}
}
proc commPut(ref dest : ?t, src : t) : void
would have to obtain the value of b
first, as it is is passed by value. These are low-level primitives, but if you want some kind of error-proofing, you could implement commPut
to take the second argument by ref
and use a local
block or something when getting the value.
With regards to commCopy
, so long as a non-blocking version of this exists, I'm fine with that. However one significant thing I would like would be to perform non-blocking puts and gets to multiple locations. If commCopy
returns a chpl_comm_nb_handle_t
(or a wrapper for it it), I'm fine with that since I can then collect a series of requests and construct my own future.
var x : int;
var y : real;
var z : [1..100] int;
on Locales[1] {
// commCopyNB specific overload for arrays/slices
var handles = (commCopyNB(x, 5), commCopyNB(y, 3.14), commCopyNB(z[1..100], 1..100));
async(
lambda(handles : 3 * nbHandleType) {
for handle in handles do commWait(handle);
}, handles
);
}
Something interesting to think about -- do you require non-blocking operations to be ordered, or is it ok if the break they memory consistency model. e.g.
var a = 0;
on Locales[1] {
commPutNB(a, 1);
commPutNB(a, 2);
commWait(...)
// what are the legal values of a? 1? 2? either?
}
If you care about ordering, the runtime has to do a lot more work to ensure these PUTs occur in order especially on networks with adaptive routing.
If you don't care about ordering or memory consistency -- maybe you want something like unordered operations such as UnorderedCopy. On a Cray-XC we map unordered operations to chained transactions (buffer operations and only talk to NIC once per buffer instead of once per operation.) This can be ~2.5 faster than just non-blocking ops, but the benefits really depend on the message size and the underlying network.
I don't think I have a specific point here, just some general musings:
do you require non-blocking operations to be ordered, or is it ok if the break they memory consistency model
Yes it is okay for it to be unordered because it's a non-blocking. If I'm doing a non blocking put, then I'm fully aware of the consequences and welcome the benefits. Emphasis on low-level, let me shoot myself in the foot because if there wasn't a standard way to do it, I'd find some less safe way to do it anyway.
Second, I really don't like the specific targeting of Cray machines. If under the hood, commCopy
does this on uGNI, great, but if I want to run it on infiniband and over udp (where non-blocking communication can really benefit me), I want the same code to work.
As to whether this interferes with the compiler and runtime, it comes down to "let me shoot myself in the foot". There certainly are times where I know better about how communication should be in my program than the compiler, and I plan to use these non-blocking constructs when such cases arise, not before profiling.
There is a distinction between non-blocking and unordered. In UPC for example, non-blocking operations from the same thread are ordered.
UnorderedCopy on ibv and other networks will soon map down to non-blocking comm or whatever is fastest for that particular network. All I meant was that non-blocking ops aren't always the fastest you can get, so if you can, leave the names vague so the runtime can do whatever is best for the hardware.
My intention was to convey some of the problems and corner cases we have run into while working on similar issues. If you just want to build what you have described in the issue then use primitives and externs to call the runtime functions and note that since int32_t is a fixed size integer you just use int(32) in chapel. The runtime API will change over time, so don't expect it to be stable.
There is a distinction between non-blocking and unordered. In UPC for example, non-blocking operations from the same thread are ordered.
I see, I suppose that makes sense. I'm assuming that in the communication layer, non-blocking communications get dispatched in batch so that both commPutNB(a, 1); commPutNB(a, 2);
can occur in whatever order the communication layer wants. In that case, I'd suspect that if you desired some kind of ordering, it would be nice to maintain some illusion of sequential consistency.
UnorderedCopy on ibv and other networks will soon map down to non-blocking comm or whatever is fastest for that particular network.
I see, I see. UnorderedCopy
does seem rather promising. My apologies for dismissing the suggestion. Although right now it only works on numeric types, hm? I suppose that would suffice for most applications, and is satisfactory for my short-term needs as well.
If you just want to build what you have described in the issue then use primitives and externs to call the runtime functions and note that since int32_t is a fixed size integer you just use int(32) in chapel.
I still need to obtain the typeIndex
from a user-specified type
. I was thinking of experimenting with this constants I found in the compiler
Maybe that can help for determining what 32-bit constant I need to represent a type
.
Also quick question @ronawho does UnorderedCopy
allow specifying a wide-reference as the dest
argument? As in, can we use it to perform unordered PUTs as well?
In terms of correctness/functionality -- UnorderedCopy supports arbitrary wideness (src/dst both local, src remote, dst remote, src/dst remote)
In terms of performance -- Today UnorderedCopy only optimizes GETs (the src remote case) and it only supports numeric types and is only optimized on Crays. In the near term we will optimize PUTs and GETPUT as well, extend the optimization to any POD type, and extend the optimization to non-ugni comm layers.
My expectation is that for non-ugni comm layers we will map UnorderedCopy to non-blocking transactions. For POD types larger than ~32 bytes we'll probably also use non-blocking transactions under ugni too.
Unordered operations can fully saturate the network injection rate for small messages on a cray. It took a while in order to figure out how to get that performance in a mechanism that is at least somewhat usable for advanced users. Recent work has been focused on simplifying the runtime implementation and moving stuff into common code so it's easier to flesh out the API and extend to other comm layers.
And to be clear -- I think there is still value in exposing other low level comm primitives. UnorderedCopy may not be exactly what you need in all cases. I brought it up because:
I'm assuming that in the communication layer, non-blocking communications get dispatched in batch so that both commPutNB(a, 1); commPutNB(a, 2); can occur in whatever order the communication layer wants. In that case, I'd suspect that if you desired some kind of ordering, it would be nice to maintain some illusion of sequential consistency.
I could be wrong, but I don't think you can batch up non-blocking ops easily (unordered can/does, but I think non-blocking has to send immediately, or you need a background thread to occasionally flush buffers.) e.g. for non-blocking ops I think the following program is legal, but it would deadlock if you buffered. You either need a background thread to flush buffers, or for seq_cst atomics to flush buffers or something.
var a, b: atomic int;
begin on Locales[1] { a.addNonBlocking(1); b.waitFor(1); }
a.waitFor(1);
b.add(1);
+1 for adding some kind of copy
and using unorderedCopy
for the non-blocking operations
As part of multiresolution design philosophy, I believe that the user should have access to low-level primitives, especially to something that is considered relatively stable, such as to the communication layer (and honestly even the tasking layer, but that is a separate issue). In particular, the user should be allowed to explicitly call PUT and GET, and there should be a user-facing API for this.
Where
Future
can wrap thechpl_comm_nb_handle_t
objects. I believe this can be rather valuable for siphoning raw performance that comes from having a priori knowledge of the algorithm. As well, some overloads, say up to 8 parameters of various types, would be appreciated (or even variadic if that is possible).Currently, the thing stopping me from creating this myself is getting the
int32_t typeIndex
argument, which seems to be generated by the compiler and does not seem to be exposed via compiler primitives. Hence if the assistance of the compiler is needed, it would be nice to have whatever minimal changes to the compiler are necessary to make this happen.