chapel-lang / chapel

a Productive Parallel Programming Language
https://chapel-lang.org
Other
1.78k stars 418 forks source link

ZMQ Poll #13003

Open LouisJenkinsCS opened 5 years ago

LouisJenkinsCS commented 5 years ago

I'm wondering if its possible to implement ZMQ Poll in the ZMQ module. It would be nice to be able to have, say, one socket per task per locale in our program, and then poll on results. Could be useful in certain computations where it requires a reactive flow such as N-Bodies/Nearest Neighbor computations. Lets say you have something like...

var perTaskPerLocaleDim = 1..numLocales * here.maxTaskPar;
var commMatrix : [perTaskPerLocaleDim] [perTaskPerLocaleDim] Socket;
coforall loc in Locales do on loc {
   coforall tid in 1..here.maxTaskPar {
      const commIdx = here.id * here.maxTaskPar + tid;
      var ready : [perTaskPerLocaleDim] bool = poll(commMatrix[commIdx]);
      for (avail, socket) in zip(ready, commMatrix[commIdx]) {
         if avail {
            var data = socket.recv(dataType);
            for neighbor in neighbors(here.id, tid) {
               neighbor.send(computeFn(data));
            }
         }
      }
   }
}

In my opinion, this looks rather elegant! What would hold me up from writing such nice looking code in Chapel right now?

Edit: Ignore locality right now, in my planned abstraction and in practice, this can easily be handled for the user!

lydia-duncan commented 5 years ago

I don't have any objections to this. We just didn't take the time to implement it when the module was started. I suspect we'll want to implement the ZMQ Error hierarchy mentioned in #12397 (which I had accidentally closed, whoops) to make this easier (or at least more stable)

LouisJenkinsCS commented 5 years ago

Understood! Although I'm not sure I like the fact that poll should throw a ZMQ Error if it times out here, because we actually want to get the sockets that are available. Although maybe it should throw an error if one of the socket endpoints disconnect, but I'm not sure if it should be returned as an error code for that particular socket or not (I.E instead of returning an array of bool, return an array of int which represent error codes).

lydia-duncan commented 5 years ago

I'm not seeing an error for timeouts here?

LouisJenkinsCS commented 5 years ago

Upon failure, zmq_poll() shall return -1 and set errno to one of the values defined below.

and

ETERM At least one of the members of the items array refers to a socket whose associated ØMQ context was terminated.

lydia-duncan commented 5 years ago

Ah, okay! Could also just return an array of Errors instead of throwing them?

LouisJenkinsCS commented 5 years ago

Isn't the Error a class right now? That seems a bit heavy-weight to allocate an array of N errors. For example, if we have all N sockets timeout, would we have a ZMQTimeoutError for each one? If not, how would we differentiate between a socket timing out and a disconnect? What if instead the Error would be held inside of the Socket itself, so that if an error is associated with a Socket, the user can do something like Socket.getError?

lydia-duncan commented 5 years ago

Oh, but doing something that returns multiple errors back means we wouldn't be able to rely on the C implementation for this function, or at least would have to implement repetition for the ones that didn't fail but haven't already been performed . . .

LouisJenkinsCS commented 5 years ago

I think we can. If we detect that a socket failed via 'ETERM', we can see which ones (plural) failed, flag it as failing, remove it from the set of sockets we're polling over, and retry again. This way we have can return which sockets have an actual error, have a timeout, or those that have data ready to be read.