GenServer get restarted twice after node restarts

I recently ran into the following issue. A GenServer (called CA in the rest of the post) gets restarted two times by the Horde.DynamicSupervisor when nodes had been removed and re-included few times in the cluster. The source code demonstrating the issue is available on Gitlab.

I'm using Elixir 1.10.1 and Horde 0.8.3. When I downgraded horde to 0.7.1 I couldn't reproduce the issue.

Demonstration

To demonstrate the issue:

Start the application
Create a GenServer (module POC.CA in my code) with the HTTP interface, i.e. that is execute the POCWeb.Login.exec() controller
Kill (with two times Control-c) the node on which the CA is running, restart it, re-do the operation and observe the logs

The two first steps are as follows:

Start node 1 in one terminal : $ HTTP_PORT=5001 ERL_AFLAGS="-name poc1@127.0.0.1 -setcookie abc" iex -S mix phx.server
Start node 2 in another terminal : $ HTTP_PORT=5002 ERL_AFLAGS="-name poc2@127.0.0.1 -setcookie abc" iex -S mix phx.server
Send the following request $ http -v http://localhost:5001/api cmd=login from a third terminal.

In the process above, let's assume that while the HTTP requests hit the node N1, the CA is created on N2. This gives us the following traces:

Terminal 1

    ... 
(1) Interactive Elixir (1.10.1) - press Ctrl+C to exit (type h() ENTER for help)
(2) iex(poc1@127.0.0.1)1> [debug] NodeListener.handle_info(:nodeup)
    [info] [libcluster:example] connected to :"poc2@127.0.0.1"
    [debug] Cluster members: [{POC.DReg, :"poc1@127.0.0.1"}, {POC.DReg, :"poc2@127.0.0.1"}]
    [debug] Cluster members: [{POC.DSup, :"poc1@127.0.0.1"}, {POC.DSup, :"poc2@127.0.0.1"}]
(3) [info] POST /api
    [debug] Processing with POCWeb.Login.exec/2
      Parameters: %{"cmd" => "login"}
      Pipelines: [:api]
(4) [debug] Login.exec()
(5) [debug] Login.exec(): start_child->res={:ok, #PID<21019.478.0>}
    [info] Sent 200 in 99ms

The BEAM is booting
Detection of the second node joining the cluster
Reception of the HTTP request
Start of the login controller. The controller calls Horde.DynamicSupervisor.start_child(POC.DSup, {POC.CA, {uaid, caid}}) which will call CA.start_link()
End of the login controller

Terminal 2

    ...
(1) Interactive Elixir (1.10.1) - press Ctrl+C to exit (type h() ENTER for help)
(2) iex(poc2@127.0.0.1)1> [debug] NodeListener.handle_info(:nodeup)
    [debug] Cluster members: [{POC.DReg, :"poc2@127.0.0.1"}, {POC.DReg, :"poc1@127.0.0.1"}]
    [debug] Cluster members: [{POC.DSup, :"poc2@127.0.0.1"}, {POC.DSup, :"poc1@127.0.0.1"}]
(3) [debug] CA.start_link({uaid="FFLFNTRJWV", caid="OLGFXEZLCO")
(4) [debug] CA.init(uaid="FFLFNTRJWV", caid="OLGFXEZLCO")
(5) [debug] CA.start_link(): GS.start_link->res={:ok, #PID<0.478.0>}
    [debug] CA.start_link(): process started

The BEAM is booting
Detection of the second node joining the cluster
CA.start_link called by the login controller. Calls GenServer.start_link(__MODULE__, {uaid, caid}, name: via_tuple(caid))
CA.init() called by GenServer.start_link() at step 3.
Back in CA.start_link(). The process was created on N2

Killing and restarting the nodes

From now on, we will

Kill the node on which the CA is running
Restart that node
Observe the logs
Go to step 1.

Kill N2 (as the `CA` is running on N2)

Terminal 1

(1) [debug] NodeListener.handle_info(:nodedown)
    [debug] Cluster members: [{POC.DReg, :"poc1@127.0.0.1"}]
    [debug] Cluster members: [{POC.DSup, :"poc1@127.0.0.1"}]
(2) [debug] CA.start_link({uaid="FFLFNTRJWV", caid="OLGFXEZLCO")
    [debug] CA.init(uaid="FFLFNTRJWV", caid="OLGFXEZLCO")
    [debug] CA.start_link(): GS.start_link->res={:ok, #PID<0.494.0>}
    [debug] CA.start_link(): process started

Detection of N2 failure. Cluster reorganization
Restart of the CA.

Restart of node 2

Terminal 2

(1) $ HTTP_PORT=5002 ERL_AFLAGS="-name poc2@127.0.0.1 -setcookie abc" iex -S mix phx.server
(2) Erlang/OTP 22 [erts-10.6.4] [source] [64-bit] [smp:12:12] [ds:12:12:10] [async-threads:1]
    ...
    Interactive Elixir (1.10.1) - press Ctrl+C to exit (type h() ENTER for help)
(3)    [debug] NodeListener.handle_info(:nodeup)
    iex(poc2@127.0.0.1)1> [debug] Cluster members: [{POC.DReg, :"poc2@127.0.0.1"}, {POC.DReg, :"poc1@127.0.0.1"}]
    [debug] Cluster members: [{POC.DSup, :"poc2@127.0.0.1"}, {POC.DSup, :"poc1@127.0.0.1"}]

Starting node N2
BEAM boots and the application starts
Cluster reorganization. Same actually happens on node N1

Kill N1 (as the `CA` had been restarted on N1)

Terminal 2

(1) [debug] CA.start_link({uaid="FFLFNTRJWV", caid="OLGFXEZLCO")
    [debug] CA.init(uaid="FFLFNTRJWV", caid="OLGFXEZLCO")
    [debug] CA.start_link(): GS.start_link->res={:ok, #PID<0.484.0>}
    [debug] CA.start_link(): process started

Process for the CA is restarted on node N2

Restart of node 1

The cluster gets reorganized as before

Kill N2

This where the strange thing will happen. Look at step 3.

Terminal 1

(1) [debug] NodeListener.handle_info(:nodedown)
    [debug] Cluster members: [{POC.DReg, :"poc1@127.0.0.1"}]
    [debug] Cluster members: [{POC.DSup, :"poc1@127.0.0.1"}]
(2) [debug] CA.start_link({uaid="FFLFNTRJWV", caid="OLGFXEZLCO")
    [debug] CA.init(uaid="FFLFNTRJWV", caid="OLGFXEZLCO")
    [debug] CA.start_link(): GS.start_link->res={:ok, #PID<0.482.0>}
    [debug] CA.start_link(): process started
(3) [debug] CA.start_link({uaid="FFLFNTRJWV", caid="OLGFXEZLCO")
    [debug] CA.start_link(): GS.start_link->res={:error, {:already_started, #PID<0.482.0>}}
    [debug] CA.start_link(): error: {:error, {:already_started, #PID<0.482.0>}}. process not started

Detection of the failure of node N2. Cluster re-organization
The CA gets restarted. This is normal. The following 3 lines show that the process is restarts successfully
CA.start_link() get called a second time. As the process had already been started by step 2. the error :already_started is returned

derekkraan / horde