Closed BrendanBall closed 2 years ago
Wow, indeed.. this is not supposed to happen. Even a 1ms sleep inbetween helps already.
Great find. Are you able to submit a PR to fix this @BrendanBall?
I can try. I'm currently not sure why this happens.
I can see that it does an insert to keys_ets_table
and then asynchronously when getting a crdt diff name conflicts are checked.
I don't currently understand why both processes end up in a conflict.
Do you think it's good enough to do another name conflict check in handle_call :register
, or might that cause other issues?
I believe the main race condition is between the call to lookup
and GenServer.call
in Horde.Registry.register/3
. After we do lookup
we don't check again. Only the code in the handle_call
of Horde.RegistryImpl
is run serially, the lookup
is run in parallel.
So we register both processes, but the second one is overwriting the first (in ETS). Then the first one comes in via the CRDT and sees that the second is in ETS, it sees a conflict, deregisters the second (and tells it to exit), reregisters the first (in ETS). Then the second comes in via the CRDT, it sees a conflict, deregisters the first (and tells it to exit), reregisters the second. Then the exit signal comes in from the second, resulting in it also being removed from the registry.
So I believe the issue is: 1. there is a race condition in Horde.Registry.register/3
, 2. eagerly registering in ETS causes weird things to happen later when the diffs come in from the CRDT.
Now I'm thinking about whether it is right to eagerly register things in ETS. For sure 1. needs to be fixed.
Hmm not sure if my logic on 2. is correct here, but there is definitely something fishy going on here. With some debug logging in handle_call({:register, ...
and process_diff(state, {:add, ...
we should be able to figure out what exactly is happening.
In principle the CRDT should be doing conflict resolution and only sending either pid1 or pid2 to RegistryImpl
. I believe this should be the second pid, since the CRDT is "last write wins". If that's the case, then it should see pid2 and do nothing.
OK, running the test reveals: the CRDT is sending both pids in diffs. So the scenario I sketched previously is indeed happening.
Ah it's getting both pids from the CRDT because the CRDT diff interval is very low, so it's another race condition. But the pids are at least coming in in the correct order. So we'll always get either pid1, then pid2, or only pid2 from the CRDT.
This is a bit tricky, I'm almost inclined to say that we should remove eagerly registering in ETS. This is not great, but I am not sure how else this would work?
Doing a first lookup seems to fix the issue However I'm not 100% confident that this will always fix the issue or might cause other issues because of the eventual consistency of the crdt
I am also still worried about this eagerly adding entries to the ETS table and possible race conditions stemming from that.
Do you know why entries are currently added eagerly? It could also cause other issues if we remove that?
We do it like that to make local changes immediately visible in the registry. So the impact of removing that would be that the registry would be eventually consistent also for local changes.
At the call site you expect a registry entry to be effective immediately, I think? It might lead to strange situations otherwise maybe? Because then an immediate lookup after a register will fail?
Horde users are already expected to deal with eventual consistency from one node to the next, so I am wondering whether removing the immediate consistency from local Registry operations would have a large impact.
I think most scenarios where one is starting a process and then immediately looking it up could be fixed by simply referring to the pid. That's how people are expected to deal with eventual consistency when starting remote processes.
I'm leaning towards removing immediate consistency for local registry changes. It would require a new major version number: 0.9. Would like to think on it for a while, and if anyone has any ideas or input then please mention it here.
It's been a month. I think it's a good idea to remove this "eager" process registration. Let's do that.
Thanks @derekkraan
We do it like that to make local changes immediately visible in the registry. So the impact of removing that would be that the registry would be eventually consistent also for local changes.
On Tue, Apr 26, 2022, at 10:51, Brendan Ball wrote:
Do you know why entries are currently added eagerly? It could also cause other issues if we remove that?
— Reply to this email directly, view it on GitHub https://github.com/derekkraan/horde/issues/250#issuecomment-1109528309, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAD7PLKBPDPDK3JK3MOKKWTVG6U7LANCNFSM5UK2M5VQ. You are receiving this because you commented.Message ID: @.***>
@derekkraan Please do this optionally. I think that changing this behavior could have important impacts for other use cases than the one mentioned above by you. It would be prudent to do this optionally for the user
@sleipnir I believe I did this a long time ago. I don't know what on earth is happening at Github, but I sent this email months ago.
Ok.
There seems to be a race condition resulting in all processes losing the name conflict and thus all exiting. I've reproduced it with a failing test here The race condition doesn't seem to happen if adding a 10ms sleep between the 2 spawned processes. Given that this happens in the same registry process (ie. on the same node), I don't expect there to be any race condition and I expect one process to stay alive.