I recently ran into the following issue. A GenServer (called CA in the rest of the post) gets restarted two times by the
Horde.DynamicSupervisor when nodes had been removed and re-included few times in the cluster.
The source code demonstrating the issue is available on Gitlab.
I'm using Elixir 1.10.1 and Horde 0.8.3. When I downgraded horde to 0.7.1 I couldn't reproduce the issue.
Demonstration
To demonstrate the issue:
Start the application
Create a GenServer (module POC.CA in my code) with the HTTP interface, i.e. that is execute the POCWeb.Login.exec() controller
Kill (with two times Control-c) the node on which the CA is running, restart it, re-do the operation and observe the logs
The two first steps are as follows:
Start node 1 in one terminal : $ HTTP_PORT=5001 ERL_AFLAGS="-name poc1@127.0.0.1 -setcookie abc" iex -S mix phx.server
Start node 2 in another terminal : $ HTTP_PORT=5002 ERL_AFLAGS="-name poc2@127.0.0.1 -setcookie abc" iex -S mix phx.server
Send the following request $ http -v http://localhost:5001/api cmd=login from a third terminal.
In the process above, let's assume that while the HTTP requests hit the node N1, the CA is created on N2. This gives us the following traces:
Terminal 1
...
(1) Interactive Elixir (1.10.1) - press Ctrl+C to exit (type h() ENTER for help)
(2) iex(poc1@127.0.0.1)1> [debug] NodeListener.handle_info(:nodeup)
[info] [libcluster:example] connected to :"poc2@127.0.0.1"
[debug] Cluster members: [{POC.DReg, :"poc1@127.0.0.1"}, {POC.DReg, :"poc2@127.0.0.1"}]
[debug] Cluster members: [{POC.DSup, :"poc1@127.0.0.1"}, {POC.DSup, :"poc2@127.0.0.1"}]
(3) [info] POST /api
[debug] Processing with POCWeb.Login.exec/2
Parameters: %{"cmd" => "login"}
Pipelines: [:api]
(4) [debug] Login.exec()
(5) [debug] Login.exec(): start_child->res={:ok, #PID<21019.478.0>}
[info] Sent 200 in 99ms
The BEAM is booting
Detection of the second node joining the cluster
Reception of the HTTP request
Start of the login controller. The controller calls Horde.DynamicSupervisor.start_child(POC.DSup, {POC.CA, {uaid, caid}}) which will call CA.start_link()
End of the login controller
Terminal 2
...
(1) Interactive Elixir (1.10.1) - press Ctrl+C to exit (type h() ENTER for help)
(2) iex(poc2@127.0.0.1)1> [debug] NodeListener.handle_info(:nodeup)
[debug] Cluster members: [{POC.DReg, :"poc2@127.0.0.1"}, {POC.DReg, :"poc1@127.0.0.1"}]
[debug] Cluster members: [{POC.DSup, :"poc2@127.0.0.1"}, {POC.DSup, :"poc1@127.0.0.1"}]
(3) [debug] CA.start_link({uaid="FFLFNTRJWV", caid="OLGFXEZLCO")
(4) [debug] CA.init(uaid="FFLFNTRJWV", caid="OLGFXEZLCO")
(5) [debug] CA.start_link(): GS.start_link->res={:ok, #PID<0.478.0>}
[debug] CA.start_link(): process started
The BEAM is booting
Detection of the second node joining the cluster
CA.start_link called by the login controller. Calls GenServer.start_link(__MODULE__, {uaid, caid}, name: via_tuple(caid))
CA.init() called by GenServer.start_link() at step 3.
Back in CA.start_link(). The process was created on N2
I recently ran into the following issue. A
GenServer
(calledCA
in the rest of the post) gets restarted two times by theHorde.DynamicSupervisor
when nodes had been removed and re-included few times in the cluster. The source code demonstrating the issue is available on Gitlab.I'm using Elixir 1.10.1 and Horde 0.8.3. When I downgraded horde to 0.7.1 I couldn't reproduce the issue.
Demonstration
To demonstrate the issue:
GenServer
(modulePOC.CA
in my code) with the HTTP interface, i.e. that is execute thePOCWeb.Login.exec()
controllerControl-c
) the node on which theCA
is running, restart it, re-do the operation and observe the logsThe two first steps are as follows:
$ HTTP_PORT=5001 ERL_AFLAGS="-name poc1@127.0.0.1 -setcookie abc" iex -S mix phx.server
$ HTTP_PORT=5002 ERL_AFLAGS="-name poc2@127.0.0.1 -setcookie abc" iex -S mix phx.server
$ http -v http://localhost:5001/api cmd=login
from a third terminal.In the process above, let's assume that while the HTTP requests hit the node N1, the CA is created on N2. This gives us the following traces:
Terminal 1
login
controller. The controller callsHorde.DynamicSupervisor.start_child(POC.DSup, {POC.CA, {uaid, caid}})
which will callCA.start_link()
login
controllerTerminal 2
CA.start_link
called by thelogin
controller. CallsGenServer.start_link(__MODULE__, {uaid, caid}, name: via_tuple(caid))
CA.init()
called byGenServer.start_link()
at step 3.CA.start_link()
. The process was created on N2Killing and restarting the nodes
From now on, we will
CA
is runningKill N2 (as the
CA
is running on N2)Terminal 1
CA
.Restart of node 2
Terminal 2
Kill N1 (as the
CA
had been restarted on N1)Terminal 2
CA
is restarted on node N2Restart of node 1
The cluster gets reorganized as before
Kill N2
This where the strange thing will happen. Look at step 3.
Terminal 1
CA
gets restarted. This is normal. The following 3 lines show that the process is restarts successfullyCA.start_link()
get called a second time. As the process had already been started by step 2. the error:already_started
is returned