derekkraan / horde

Horde is a distributed Supervisor and Registry backed by DeltaCrdt
MIT License
1.3k stars 102 forks source link

race condition in name conflict in same registry process results in all processes losing #250

Closed BrendanBall closed 2 years ago

BrendanBall commented 2 years ago

There seems to be a race condition resulting in all processes losing the name conflict and thus all exiting. I've reproduced it with a failing test here The race condition doesn't seem to happen if adding a 10ms sleep between the 2 spawned processes. Given that this happens in the same registry process (ie. on the same node), I don't expect there to be any race condition and I expect one process to stay alive.

arjan commented 2 years ago

Wow, indeed.. this is not supposed to happen. Even a 1ms sleep inbetween helps already.

derekkraan commented 2 years ago

Great find. Are you able to submit a PR to fix this @BrendanBall?

BrendanBall commented 2 years ago

I can try. I'm currently not sure why this happens. I can see that it does an insert to keys_ets_table and then asynchronously when getting a crdt diff name conflicts are checked. I don't currently understand why both processes end up in a conflict.

Do you think it's good enough to do another name conflict check in handle_call :register, or might that cause other issues?

derekkraan commented 2 years ago

I believe the main race condition is between the call to lookup and GenServer.call in Horde.Registry.register/3. After we do lookup we don't check again. Only the code in the handle_call of Horde.RegistryImpl is run serially, the lookup is run in parallel.

So we register both processes, but the second one is overwriting the first (in ETS). Then the first one comes in via the CRDT and sees that the second is in ETS, it sees a conflict, deregisters the second (and tells it to exit), reregisters the first (in ETS). Then the second comes in via the CRDT, it sees a conflict, deregisters the first (and tells it to exit), reregisters the second. Then the exit signal comes in from the second, resulting in it also being removed from the registry.

So I believe the issue is: 1. there is a race condition in Horde.Registry.register/3, 2. eagerly registering in ETS causes weird things to happen later when the diffs come in from the CRDT.

Now I'm thinking about whether it is right to eagerly register things in ETS. For sure 1. needs to be fixed.

derekkraan commented 2 years ago

Hmm not sure if my logic on 2. is correct here, but there is definitely something fishy going on here. With some debug logging in handle_call({:register, ... and process_diff(state, {:add, ... we should be able to figure out what exactly is happening.

In principle the CRDT should be doing conflict resolution and only sending either pid1 or pid2 to RegistryImpl. I believe this should be the second pid, since the CRDT is "last write wins". If that's the case, then it should see pid2 and do nothing.

derekkraan commented 2 years ago

OK, running the test reveals: the CRDT is sending both pids in diffs. So the scenario I sketched previously is indeed happening.

derekkraan commented 2 years ago

Ah it's getting both pids from the CRDT because the CRDT diff interval is very low, so it's another race condition. But the pids are at least coming in in the correct order. So we'll always get either pid1, then pid2, or only pid2 from the CRDT.

derekkraan commented 2 years ago

This is a bit tricky, I'm almost inclined to say that we should remove eagerly registering in ETS. This is not great, but I am not sure how else this would work?

BrendanBall commented 2 years ago

Doing a first lookup seems to fix the issue However I'm not 100% confident that this will always fix the issue or might cause other issues because of the eventual consistency of the crdt

derekkraan commented 2 years ago

I am also still worried about this eagerly adding entries to the ETS table and possible race conditions stemming from that.

BrendanBall commented 2 years ago

Do you know why entries are currently added eagerly? It could also cause other issues if we remove that?

derekkraan commented 2 years ago

We do it like that to make local changes immediately visible in the registry. So the impact of removing that would be that the registry would be eventually consistent also for local changes.

arjan commented 2 years ago

At the call site you expect a registry entry to be effective immediately, I think? It might lead to strange situations otherwise maybe? Because then an immediate lookup after a register will fail?

derekkraan commented 2 years ago

Horde users are already expected to deal with eventual consistency from one node to the next, so I am wondering whether removing the immediate consistency from local Registry operations would have a large impact.

I think most scenarios where one is starting a process and then immediately looking it up could be fixed by simply referring to the pid. That's how people are expected to deal with eventual consistency when starting remote processes.

I'm leaning towards removing immediate consistency for local registry changes. It would require a new major version number: 0.9. Would like to think on it for a while, and if anyone has any ideas or input then please mention it here.

derekkraan commented 2 years ago

It's been a month. I think it's a good idea to remove this "eager" process registration. Let's do that.

BrendanBall commented 2 years ago

Thanks @derekkraan

derekkraan commented 1 year ago

We do it like that to make local changes immediately visible in the registry. So the impact of removing that would be that the registry would be eventually consistent also for local changes.

On Tue, Apr 26, 2022, at 10:51, Brendan Ball wrote:

Do you know why entries are currently added eagerly? It could also cause other issues if we remove that?

— Reply to this email directly, view it on GitHub https://github.com/derekkraan/horde/issues/250#issuecomment-1109528309, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAD7PLKBPDPDK3JK3MOKKWTVG6U7LANCNFSM5UK2M5VQ. You are receiving this because you commented.Message ID: @.***>

sleipnir commented 1 year ago

@derekkraan Please do this optionally. I think that changing this behavior could have important impacts for other use cases than the one mentioned above by you. It would be prudent to do this optionally for the user

derekkraan commented 1 year ago

@sleipnir I believe I did this a long time ago. I don't know what on earth is happening at Github, but I sent this email months ago. image

sleipnir commented 1 year ago

Ok.