Remote processor atomics can lead to subtle deadlock

ronawho commented 5 years ago

Simple programs that directly or indirectly wait on the value of an atomic can deadlock. The following program will very likely deadlock when network atomics are not availble:

var a: atomic int;
cobegin {
  coforall 1..here.maxTaskPar do while a.read() != 1 { } // ok with `chpl_task_yield()`
  on Locales[numLocales-1] do a.fetchAdd(1); // ok for a.add(1);
}

This is because our default tasking layer using cooperative scheduling and we implement remote atomics with on-stmts+processor-atomics when network atomics aren't available. Something roughly like:

proc AtomicInt.fetchAdd(val: int) {
  var ret: int;
  on this.locale do ret = processor_atomic_fetch_add(this.value, a);
  return ret;
}

We allow non-fetching atomics to fire the fast-on optimization (run the operation directly in the progress thread instead of creating tasks) but fetching atomics will have comm, so we don't currently allow those for fast-ons.

This can create subtle deadlocks that I think we can/should avoid. A simple solution would be to allow fetching atomics to run in the progress thread, but I'm not sure that's a great idea because we really don't want to do any blocking operations in the progress thread. That said, this is only a problem for configurations without network atomics, so maybe the performance isn't as important in those configurations?

ronawho commented 5 years ago

In the trivial code example it's pretty easy to see the source of the deadlock, but it's pretty easy to run into this in cases where the source of the problem is not obvious at all.

bradcray commented 5 years ago

Thanks for the reminder as to why we can't run this in the progress thread. I think at times we've talked about having a "service thread" or the like to handle important tasks that we don't want to schedule alongside typical user tasks. Do you view that as a potential solution here?

this is only a problem for configurations without network atomics, so maybe the performance isn't as important in those configurations?

Though I often throw non-Cray networks under the bus, I'm not sure we should in this case. Also, can't blocking the progress thread cause bigger problems than bad performance, such as for GASNet to start complaining to the console that messages are backing up or the like?

ronawho commented 5 years ago

Thanks for the reminder as to why we can't run this in the progress thread. I think at times we've talked about having a "service thread" or the like to handle important tasks that we don't want to schedule alongside typical user tasks. Do you view that as a potential solution here?

I think we actually could run them in the progress thread, but we don't today. The progress thread cannot run operations that could block, and we try to avoid "long-running" operations. Since we don't know the size of comm at compile time we just avoid fast-ons for any comm. I think we could special case for fetching-atomics though.

this is only a problem for configurations without network atomics, so maybe the performance isn't as important in those configurations?

Though I often throw non-Cray networks under the bus, I'm not sure we should in this case.

I agree this is important for non-Crays. In this case I meant CHPL_NETWORK_ATOMICS, not "hardware support for network atomics." For example ofi has network-atomics even for udp and will not have this problem. GASNet-EX also added support for network atomics.

Also, can't blocking the progress thread cause bigger problems than bad performance, such as for GASNet to start complaining to the console that messages are backing up or the like?

This isn't really "blocking" the progress thread, it would just slow it down. For the progress thread blocking would be like "we can't make progress" vs "this operation is slow". But yeah, this overall tension is why we don't currently implement fetching-atomics in the progress thread.

mppf commented 5 years ago

GASNet-EX also added support for network atomics.

Does that mean that upgrading our use of GASNet calls would solve this problem with the UDP condiuit or with Infiniband, say? In other words, does GASNet-EX have some kind of equivalent of this on statement for networks that don't have hardware support for network atomics? In that event, I think we should simply start using these in GASNet configurations and call it a day.

This isn't really "blocking" the progress thread, it would just slow it down.

We could always break up the fetching atomic operation in to a series of nonblocking operations that are scheduled on the same progress thread. This might be warranted in this case. That way, the progress thread would not actually "block" while waiting for the PUT for the fetchAdd to complete (say); rather, it would

do the atomic
start the PUT
enqueue some action to complete the PUT, and then after that, indicate task completion.

and then it would work through enqueued actions periodically (hopefully when mostly idle).

gbtitus commented 5 years ago

I would argue that allowing a fetching atomic to run in an AM progress thread is a greater danger in a configuration without network atomics, because there the only way to do "network" atomics is via AM, so we'd be initiating an AM from an AM handler and could livelock. Maybe we could use an NB AM instead of a regular one, but I think that just moves the lack-of-progress problem: if the remote AM handler for whatever reason is prevented from acting on the NB AM, the fetched value will never arrive back at the initiator and the task that did it will stall. It actually seems fairly safe to do a fetching AM from an AM handler in a configuration where network atomics are present (and the specific AMO in question doesn't need an AM, i.e. the target memory is registered, when registration is required, etc., etc.), although one would want to do it in an NB way, that is, fire it and then go ahead and handle other AMs while waiting for the response.

mppf commented 5 years ago

so we'd be initiating an AM from an AM handler and could livelock. Maybe we could use an NB AM instead of a regular one, but I think that just moves the lack-of-progress problem: if the remote AM handler for whatever reason is prevented from acting on the NB AM, the fetched value will never arrive back at the initiator and the task that did it will stall.

I might be missing something but I think that as long as the "progress" thread doing the fast-on statements doesn't block, then it's fine for another task to block. My point was that remote atomics could be implemented entirely as nonblocking operations running in the progress threads (i.e. fast-on). As long as these are written in a way that doesn't stop the progress threads from servicing more requests, isn't it just as good as the hardware-supported atomics case?

gbtitus commented 5 years ago

My point was that remote atomics could be implemented entirely as nonblocking operations running in the progress threads (i.e. fast-on). ...

Yes, as long as those ops are done in a non-blocking way and the overhead of progressing through the steps you outlined isn't objectionable (and I doubt it will be) then functionally it'll work.

chapel-lang / chapel

Remote processor atomics can lead to subtle deadlock #12888