Closed bartj3 closed 2 years ago
This MR is a possible solution for #207, it's WIP as we'd like to add tests to prove that this solves the problem and doesn't create other problems.
Hi @bartj3 have you made any progress with this PR?
I believe the bug that you are looking at might be present in Horde.Registry, not so much in Horde.DynamicSupervisor. Horde.Registry must ensure that registered processes are globally unique; if it is not doing this then it would be good to know why. Do you have any insights?
No progress yet, we planned to spend a couple of hours on this as a team again next week. As it felt like last week we learned a lot about Horde supervisors like that. It's good to know we might have been looking in the wrong place.
Without any knowledge on the registry implementation (yet), I'm guessing uniqueness is based on the process identifier, not the pid but the id horde gives it. And after a node is shutdown and the new node picks the process up again it generates a new child id, so it is unique. It's just not the same as the identifier that the old node was using. That's why we felt like the supervisor would need to deal with this. Uniqueness works, it's just that during a handover a new unique id is generated without cleaning up the old one.
We didn't fully understand why the random child ids were added so felt like it'd be best to leave that in. If it's not needed during handover #226 might be a way easier fix.
I'll spend some time reading up on the registry implementation and will take another stab at trying to understand the randomize child id change. That'll probably still be next week together with the rest of the team.
Before the randomize_ids change, we were essentially controlling for uniqueness in two places (Horde.Registry and Horde.DynamicSupervisor). This can go wrong when there is a conflict between process A and B, and Horde.Registry chooses process A, while Horde.DynamicSupervisor chooses process B.
To solve this issue, we don't control for uniqueness anymore in Horde.DynamicSupervisor. So any time we restart a process on a new node, we randomize its id. This ensures that if there is a conflict (due to ex, a netsplit), Horde.Registry will be in charge of resolving uniqueness.
Horde.Registry doesn't work on the child_id, nobody outside of Horde.DynamicSupervisor gets to see the child_id. Horde.Registry works on whatever you key you configure in your own application (either with a :via
tuple, or by calling Horde.Registry.register
in your process).
Thanks for the explanation! Like I said, I know very little about the Registry right now, so I'll get back to you once I've read up a bit together with the team. Armed with the test script by @fidr and a lot of debug statements we should be able to tackle this 💪
Just to note, the Registry does work fine here in making sure there are no duplicate processes.
The issue in Horde.DynamicSupervisor is that it keeps references to dead pids (that were handed off during a node restart). These references will never go away. So each time any node goes down, all these ghost references are actually handed off and start real processes again, even though they weren't alive.
The amount of ghost references actually grows indefinitely with rolling deploys unless cleared forcefully by taking all the nodes down.
We dove into this again, but our tests were inconsistent and couldn't figure out what caused this. So no major progress today.
We looked into Registries for a bit, but still think the problem is with the Supervisor. Our wording earlier was a bit misleading. It's not duplicate processes in the registry or supervisor. It's that references to processes that were handed over when a node goes down are not removed from the DynamicSupervisor CRDT.
We're not sure why the DynamicSupervisorTest "failed horde's processes are taken over by other hordes" doesn't appear to be affected by this. We thought it might have to do with the way we started and stopped workers but that does not appear to have an impact.
We used a test which was based on https://github.com/derekkraan/horde/issues/207#issuecomment-821071044
Thanks @fidr and @bartj3 for the insights. We can't remove these processes from the CRDT when a node goes down; but we can when a node is removed from the CRDT (as a member). There is some code in Horde.RegistryImpl
that does this for Horde.Registry
, it seems like it should solve this problem as well.
just an update on where we're at. Today we spend another 2 hours on this. Trying to figure out more about the responsability of the registry.
So if:
"process 1", on worker 1 with child id 1 "process 2" on worker 2 with child id 2 We take down worker 1 by killing the app worker 2 starts up "process 1" with child id 4 We start up worker 1 again With passive handover worker 2 keeps "process 1" (child id 4) At this point, in the CRDT, there's still a reference to child id 1, on "process 1" on worker 1... which doesn't exist.
Without our change, who would ever remove "process 1" on worker 1 with child id 1?
The registry should be doing this because "process 1" is duplicated from a Horde perspective, however, it isn't actually duplicated, there's not truly two elixir processes running, and because of the different child ID, it isn't actually a conflict in the CRDT.
We feel like this shouldn't matter, as lib/horde/registry_impl.ex:200 sends a Process.exit, that one doesn't do anything in this scenario, the actual Elixir process doesn't exist. However, the unregister_local should clean up the CRDT here, but doesn't, or doesn't because we don't have the registry running properly in the test we've been using.
In the past (pre 0.6.0) we might have gotten away with an incorrect Registry setup because without randomized child ID it was a conflict in the CRDT where the old entry was overwritten.
After this thought we figured we'd add a registry to our test, but that's still a work in progress. We quickly added two registries to our (uncommitted) test with 2 supervisors, which each would start 2 processes. But somehow ended up with less processes than we hoped for. At this point that might be due to incorrect Registry usage.
We've planned another session for this next wednesday
Still working on this. Today we spent some time on setting up a test that runs both supervisors and registries.
We're still not clear on how the registry should communicate to the supervisor to clean up its state.
However even the basic registry setup is acting odd in our test so we'll do some more research.
After drawing the situation (see attachment) I believe the solution may be simple.
Problem: Dead processes linger in the Supervisor CRDT when a Supervisor dies. When the Supervisor is restarted, the dead processes are still there, causing an issue on the next handover. The registry fails to clean this up, as the termination message which is sent to the Supervisor cannot be processed (the Supervisor is dead after all).
Solution: A Supervisor could clean up the CRDT when the Supervisor is started up. Any processes lingering in the CRDT at start must be erroneous, as a supervisor takes down all linked processes when it halts or dies. As such a clean-up at boot would take care of pollution.
We looked into this possible solution, our initial idea was to flush the old CRDT in the lib/horde/dynamic_supervisor_impl.ex
init function. However at that point there's no CRDT yet, the supervisor is not part of the cluster yet.
There's another solution: the Registy tells the supervisor to clean up its state, but we couldn't figure out the right place to do so. The registry has no awareness of the duplication going on, so it can't tell the Supervisor exactly what to do, it can only tell it what is the truth, and to delete everything else. But we couldn't figure out how to do this.
We fell back again to our initial solution: the supervisor modifies the state by duplicating processes during handover, that's the exact moment we know duplication is created, and the moment we can easily clean this up. We are going to reread the discussion to double check if we understood the objections against deleting processes from within the supervisor. And if we understand the objections properly we could potentially write tests that cover these concerns and see if we can make the test suite green.
Closing this PR as the fix proposed might not have been in the right direction and as I'm no longer using Horde on a daily basis it's unlikely that I'll come with an alternative fix for this.
By cleaning up dead child processes when spawning the child process on the new supervisor.