htex command channel use is not thread safe

benclifford commented 4 months ago

Describe the bug

The interchange command channel is presented on the submit side as an object executor.command_client that is invoked from any thread. However, it uses a ZMQ channel which is not thread safe. When two threads hit this too close together in time, this causes weird behavior in ZMQ: at least sometimes a segfault style crash, and I suspect possibly also some hangs.

The command channel accesses should be made less thread safe. I don't have a favoured solution. Some ideas that I am not super fond of but I guess would be ok: i) make a new ZMQ connection (in the opposite TCP-direction to the current command channel) for every thread that makes a command client invocation. ii) use an internal cross-thread RPC mechanism and gateway those RPCs in the thread that owns the ZMQ connection.

In desc branch parsl, I've put a lot of locks around ZMQ command channel accesses, but I don't believe this is sufficient to guarantee thread-safety given ZMQ's discussion of eg. memory barriers

To Reproduce it's a race condition

Expected behavior this thread unsafety should not happen

benclifford commented 3 months ago

Another observation I made, that I think is relevant to this issue:

The interchange command client contains this:
except zmq.ZMQError:
logger.exception("Potential ZMQ REQ-REP deadlock caught")
logger.info("Trying to reestablish context")
self.zmq_context.recreate()
self.create_socket_and_bind()
The only reason I can figure out for this is that the command client is used in a non-thread-safe manner and that multiple commands can be invoked from multiple threads without receiving responses - which is non-thread-safe in two ways: 1) you can't send two REQs on the same ZMQ socket in a row, without waiting for the first REP; and 2) you can't use the same ZMQ socket across multiple threads even if you sequence your operations properly.

3376 talks about non-threadsafe use of the command client, and I guess I hope that if #3376 is fixed, the above can go away.

benclifford commented 3 months ago

Further to that last comment, I'm unclear in which situations that re-establish code will work. I think it makes the following assumptions: i) the previous TCP-level listening socket has been closed due to this ZMQError - because otherwise, the rebind in this except block will not be able to rebind ii) the interchange-side command ZMQ socket is in a "waiting for REQ" state, so that it will rebind in a way that is compatible with the above protocol flow: what happens, for example, if the rebind happens with a reply waiting to be sent back to the command client? (does ZMQ reconnect and then attempt to send the REP into the reconnected socket that is not expecting a REP?)

Parsl / parsl

htex command channel use is not thread safe #3376

3376 talks about non-threadsafe use of the command client, and I guess I hope that if #3376 is fixed, the above can go away.