derekkraan / horde

Horde is a distributed Supervisor and Registry backed by DeltaCrdt
MIT License
1.32k stars 106 forks source link

GenServer get restarted twice after node restarts #224

Open mszmurlo opened 3 years ago

mszmurlo commented 3 years ago

I recently ran into the following issue. A GenServer (called CA in the rest of the post) gets restarted two times by the Horde.DynamicSupervisor when nodes had been removed and re-included few times in the cluster. The source code demonstrating the issue is available on Gitlab.

I'm using Elixir 1.10.1 and Horde 0.8.3. When I downgraded horde to 0.7.1 I couldn't reproduce the issue.

Demonstration

To demonstrate the issue:

  1. Start the application
  2. Create a GenServer (module POC.CA in my code) with the HTTP interface, i.e. that is execute the POCWeb.Login.exec() controller
  3. Kill (with two times Control-c) the node on which the CA is running, restart it, re-do the operation and observe the logs

The two first steps are as follows:

In the process above, let's assume that while the HTTP requests hit the node N1, the CA is created on N2. This gives us the following traces:

Terminal 1
    ... 
(1) Interactive Elixir (1.10.1) - press Ctrl+C to exit (type h() ENTER for help)
(2) iex(poc1@127.0.0.1)1> [debug] NodeListener.handle_info(:nodeup)
    [info] [libcluster:example] connected to :"poc2@127.0.0.1"
    [debug] Cluster members: [{POC.DReg, :"poc1@127.0.0.1"}, {POC.DReg, :"poc2@127.0.0.1"}]
    [debug] Cluster members: [{POC.DSup, :"poc1@127.0.0.1"}, {POC.DSup, :"poc2@127.0.0.1"}]
(3) [info] POST /api
    [debug] Processing with POCWeb.Login.exec/2
      Parameters: %{"cmd" => "login"}
      Pipelines: [:api]
(4) [debug] Login.exec()
(5) [debug] Login.exec(): start_child->res={:ok, #PID<21019.478.0>}
    [info] Sent 200 in 99ms
  1. The BEAM is booting
  2. Detection of the second node joining the cluster
  3. Reception of the HTTP request
  4. Start of the login controller. The controller calls Horde.DynamicSupervisor.start_child(POC.DSup, {POC.CA, {uaid, caid}}) which will call CA.start_link()
  5. End of the login controller
Terminal 2
    ...
(1) Interactive Elixir (1.10.1) - press Ctrl+C to exit (type h() ENTER for help)
(2) iex(poc2@127.0.0.1)1> [debug] NodeListener.handle_info(:nodeup)
    [debug] Cluster members: [{POC.DReg, :"poc2@127.0.0.1"}, {POC.DReg, :"poc1@127.0.0.1"}]
    [debug] Cluster members: [{POC.DSup, :"poc2@127.0.0.1"}, {POC.DSup, :"poc1@127.0.0.1"}]
(3) [debug] CA.start_link({uaid="FFLFNTRJWV", caid="OLGFXEZLCO")
(4) [debug] CA.init(uaid="FFLFNTRJWV", caid="OLGFXEZLCO")
(5) [debug] CA.start_link(): GS.start_link->res={:ok, #PID<0.478.0>}
    [debug] CA.start_link(): process started
  1. The BEAM is booting
  2. Detection of the second node joining the cluster
  3. CA.start_link called by the login controller. Calls GenServer.start_link(__MODULE__, {uaid, caid}, name: via_tuple(caid))
  4. CA.init() called by GenServer.start_link() at step 3.
  5. Back in CA.start_link(). The process was created on N2

Killing and restarting the nodes

From now on, we will

  1. Kill the node on which the CA is running
  2. Restart that node
  3. Observe the logs
  4. Go to step 1.

Kill N2 (as the CA is running on N2)

Terminal 1
(1) [debug] NodeListener.handle_info(:nodedown)
    [debug] Cluster members: [{POC.DReg, :"poc1@127.0.0.1"}]
    [debug] Cluster members: [{POC.DSup, :"poc1@127.0.0.1"}]
(2) [debug] CA.start_link({uaid="FFLFNTRJWV", caid="OLGFXEZLCO")
    [debug] CA.init(uaid="FFLFNTRJWV", caid="OLGFXEZLCO")
    [debug] CA.start_link(): GS.start_link->res={:ok, #PID<0.494.0>}
    [debug] CA.start_link(): process started
  1. Detection of N2 failure. Cluster reorganization
  2. Restart of the CA.

Restart of node 2

Terminal 2
(1) $ HTTP_PORT=5002 ERL_AFLAGS="-name poc2@127.0.0.1 -setcookie abc" iex -S mix phx.server
(2) Erlang/OTP 22 [erts-10.6.4] [source] [64-bit] [smp:12:12] [ds:12:12:10] [async-threads:1]
    ...
    Interactive Elixir (1.10.1) - press Ctrl+C to exit (type h() ENTER for help)
(3)    [debug] NodeListener.handle_info(:nodeup)
    iex(poc2@127.0.0.1)1> [debug] Cluster members: [{POC.DReg, :"poc2@127.0.0.1"}, {POC.DReg, :"poc1@127.0.0.1"}]
    [debug] Cluster members: [{POC.DSup, :"poc2@127.0.0.1"}, {POC.DSup, :"poc1@127.0.0.1"}]
  1. Starting node N2
  2. BEAM boots and the application starts
  3. Cluster reorganization. Same actually happens on node N1

Kill N1 (as the CA had been restarted on N1)

Terminal 2
(1) [debug] CA.start_link({uaid="FFLFNTRJWV", caid="OLGFXEZLCO")
    [debug] CA.init(uaid="FFLFNTRJWV", caid="OLGFXEZLCO")
    [debug] CA.start_link(): GS.start_link->res={:ok, #PID<0.484.0>}
    [debug] CA.start_link(): process started
  1. Process for the CA is restarted on node N2

Restart of node 1

The cluster gets reorganized as before

Kill N2

This where the strange thing will happen. Look at step 3.

Terminal 1
(1) [debug] NodeListener.handle_info(:nodedown)
    [debug] Cluster members: [{POC.DReg, :"poc1@127.0.0.1"}]
    [debug] Cluster members: [{POC.DSup, :"poc1@127.0.0.1"}]
(2) [debug] CA.start_link({uaid="FFLFNTRJWV", caid="OLGFXEZLCO")
    [debug] CA.init(uaid="FFLFNTRJWV", caid="OLGFXEZLCO")
    [debug] CA.start_link(): GS.start_link->res={:ok, #PID<0.482.0>}
    [debug] CA.start_link(): process started
(3) [debug] CA.start_link({uaid="FFLFNTRJWV", caid="OLGFXEZLCO")
    [debug] CA.start_link(): GS.start_link->res={:error, {:already_started, #PID<0.482.0>}}
    [debug] CA.start_link(): error: {:error, {:already_started, #PID<0.482.0>}}. process not started
  1. Detection of the failure of node N2. Cluster re-organization
  2. The CA gets restarted. This is normal. The following 3 lines show that the process is restarts successfully
  3. CA.start_link() get called a second time. As the process had already been started by step 2. the error :already_started is returned