build-trust / ockam

Orchestrate end-to-end encryption, cryptographic identities, mutual authentication, and authorization policies between distributed applications – at massive scale.
https://ockam.io
Apache License 2.0
4.43k stars 560 forks source link

Possible deadlock under high load, due to fixed-capacity channels? #1947

Open thomcc opened 2 years ago

thomcc commented 2 years ago

https://github.com/ockam-network/ockam/blob/8bf1502c8a105c874e8d493097c6c57948183a09/implementations/rust/ockam/ockam_node/src/context.rs#L129 (and some others) use a fixed-capacity channel (which if full, blocks the sender until it isn't).

This is surprising to me since there's a pretty classic deadlock that can happen because of these in some use cases like ours... Or at least it seems like it could — there's enough going on in ockam_node that I can't be sure if this is handled — it could be.


Specifically, the situation where I worry a deadlock happens is:

  1. If you have two workers, W1 and W2, where W1 and W2 which may message each-other at least some of the time. (I think they need to be local workers too, but maybe I'm just lacking imagination).

  2. Both W1 and W2 are very nearly at their channels capacity. (For example, if they're unable to process messages fast enough, and the backlog fills all the way up.

  3. Now, W1 sends a message to W2. Because W2s channel is full, the send(msg).await wont resolve until W2 pops the next item of its channel and W1's message finds space.

  4. Before W2 gets to do that (while W1 is still blocked), W2 sends W1 a message.

  5. Sadly, W1's channel is also at capacity, so W2's send(msg).await won't wont resolve until W1 pops the next item of its channel.

That is — W1 is blocked until W2 pops an item off its queue... and W2 is also blocked until W1 pops an item off its queue.


That said, I don't know for sure this is actually possible — message sending in ockam is involved enough that it might not be, and might just look possible for ockam. (Concretely, I'm assuming ctx.send_message bottoms out on something like a channel::send to these channels)

That said, I have see this type of deadlock before in other projeces, so even though it sounds unlikely, it definitely happens under high load (especially if you arent on full release build).

younes-io commented 2 years ago

@thomcc : What about a queuing system? Or, maybe using semaphores (number of semaphores = capacity of the worker) can allow some sort of synchronization between workers. P.S: I'm new here, I don't fully grasp the Ockam network, still discovering :)

thomcc commented 2 years ago

I think the fix for this if it's an issue is when sending a message to an actor who has a full channel, return an error if that actor is itself waiting on a message send.

Something like that anyway. That said, it will be a bit tricky to restructure things so that this is possible. I also am not 100% sure that this isn't already compensated for (possibly even using a similar scheme to what I just described) somewhere in the system... @spacekookie might know?

younes-io commented 2 years ago

@thomcc : do you think I can find the full (up-to-date) specifications of this workflow / protocol in the documentation ?

thomcc commented 2 years ago

Sorry, I'm not aware of anything that documents the internals of ockam_node. I do think that they're being simplified somewhat in #2007, if it's any consolation.

younes-io commented 2 years ago

Thank you @thomcc I could write a specification in TLA+ for the protocol if it were documented to make sure there is no deadlock.

mrinalwadhwa commented 2 years ago

This talk by @spacekookie is good high level overview of ockam_node internals. https://youtu.be/d4Mk0TK6rYA?t=417

younes-io commented 2 years ago

Thank you @mrinalwadhwa :) I will look into it!

younes-io commented 2 years ago

@thomcc : Could you please point me to an example project of two workers communicating together as you describe in your post? I didn't find in the repo such an example. Thank you!

thomcc commented 2 years ago

Hmm, I don't think we have an example of it on hand. It was more of a thought experiment.