Open nash8114 opened 2 months ago
I've pushed a new commit. The pushed changes should help fix an issue upgrading to this new version.
The implementation now supports both tuple size 3 and 4 for :proxy_operation. When :proxy_message_ttl is set to :infinity (the default), proxy_to_node
will use tuple size 3. This means nodes running the new version and the old version can still proxy messages to each other. If during the upgrade :proxy_message_ttl
would be set to an integer value, the upgrade will cause :proxy_operation messeages from upgraded nodes to old nodes to fail. The CHANGELOG would need to reflect this as an upgrade risk.
Yes, please do add an entry to the changelog for this.
The approach you are taking in this PR is to send an error when TTL goes to 0.
But could we also just start the process when TTL is 0?
What do you think?
That sounds like a great idea.
I've avoided it so far as it means that proxy_to_node
would need to be able to make a distinction between different messages. Currently it is completely message agnostic. The only two message types it currently handles are :start_child
and :terminate_child
. Performing :start_child
locally when TTL runs out makes a lot of sense. The same is not true for terminate_child
.
So if you're OK with proxy_to_node being message specific, I can build in an exception for :start_child
. But how to handle an expired TTL for :terminate_child
? This should not occur of course, except when the TTL is set to 0 or there is some corruption in the state (where two nodes disagree on who owns a process)
Pushed an update which changes the start_child
logic to perform add_child
locally if the TTL has expired. Tests have been updated accordingly.
On a another note, I am considering returning an error when starting a Horde.DynamicSupervisor with a :proxy_message_ttl set to zero, as obviously no one should ever set it to zero, and it would cause issues for terminate_child
.
There is also still the issue that :proxy_message_ttl defaults to :infinity, which allows for upgrading seamlessly from a previous version to this version. But at what point would the :proxy_message_ttl default to something else? Or should the README indicate that setting the proxy TTL (after upgrade) is recommended?
terminate_child
is not likely to loop, as the node on which a process is running is known. If it does loop for some odd reason, the message would expire.
graph TD;
pa[process A] -- call start_child() --> A[node A];
A -- proxy start_child(ttl: 2) --> B[node B];
B -- proxy start_child(ttl: 1) --> C[node C];
C -- TTL expired, perform add_child and reply {:ok, pid} --> pa;
pb[process B] -- call terminate_child() --> Ab[node A];
Ab -- proxy terminiate_child(ttl: 2) --> Bb[node B];
Bb -- reply :ok --> pb;
If we document it in the documentation and in the changelog, that should be enough. I'll emphasise it when doing the usual tweets so that hopefully more people get the message. But maybe it's also something most people will only discover if they also have the issue in question.
I'll write something about the option in the readme.
In normal cases the TTL should not be necessary, but there are some cases which are improved by this. Most notably:
UniformRandomDistribution
algorithm from passing messages too often (image 100 nodes with UniformRandomDistribution, each having a 1% chance of actually handling a start_child, and 99% change of playing hot potato)
A message proxied to another node by the DynamicSupervisor can be proxied again. This is clearly evident when using a large number of nodes and picking
Horde.UniformRandomDistribution
as a distribution strategy. Pushing your luck with that strategy 🍀This commit contains an implementation which limits the amount of times a message can be proxied before expiring, by adding a TTL which works similar to the TTL on IP packets.
The default TTL is
:infinity
, which means the implementation is backwards compatible. Message with TTL :infinity can bounce around between nodes forever. The max TTL can be set to any integer via the new:proxy_message_ttl
option. Each "hop" decreases the TTL for a message by 1 one. When a message with a TTL of zero needs to be proxied, an error will be returned to thereply_to
process.With a distribution strategy of
Horde.UniformDistribution
the issue of passing messages around forever is unlikely, as nodes tend to agree on the outcome ofchoose_node
. However, a recent incident whilst upgrading to OTP 27 brought this issue to light, as the underlying algorithm for choosing nodes was broken on our already upgraded nodes. This causes message to be proxied between our nodes infinitely.Please feel free to pass any form of judgement on the implementation. This is just a jumping-off platform and it can only go uphill from here 👍
With TTL of 2: