Topology change on node startup when process modules not yet loaded

msw10100 commented 6 years ago

Using Swarm 3.0.5, often when I have processes on one node and a new node joins, causing processes to move around to a different node, I will see a warning on the originating node:

2017-11-08 15:01:01.231 [warn] [swarm on a@127.0.0.1] [tracker:start_pid_remotely] "ID_4" could not be started on b@127.0.0.1: {:error, :undef}

and on the target node:

15:01:01.231 [warn]  [swarm on b@127.0.0.1] [tracker:handle_call] ** (UndefinedFunctionError) function Gptest.Service.start_link/1 is undefined (module Gptest.Service is not available)
    Gptest.Service.start_link({:id, "4"})
    (swarm) lib/swarm/tracker/tracker.ex:961: Swarm.Tracker.handle_call/3
    (stdlib) gen_statem.erl:1240: :gen_statem.call_state_function/5
    (stdlib) gen_statem.erl:1012: :gen_statem.loop_event/6
    (stdlib) proc_lib.erl:247: :proc_lib.init_p_do_apply/3

So it would appear that since Swarm loads before our application modules (Gptest, above), the VM has not yet loaded Gptest into memory when the Swarm attempts to start it on the new node. And since there are no retries when that error occurs, the process stays down until some external entity restarts the process.

Is there a mechanism that I could use to either delay the attempt to load the process or retry when I get this error?

msw10100 commented 6 years ago

I made the following patch to tracker.ex, which resolved the issue for me:

diff --git a/lib/swarm/tracker/tracker.ex b/lib/swarm/tracker/tracker.ex
index 40d846b..dd318db 100644
--- a/lib/swarm/tracker/tracker.ex
+++ b/lib/swarm/tracker/tracker.ex
@@ -1137,6 +1137,10 @@ defmodule Swarm.Tracker do
           warn "#{inspect name} could not be started on #{remote_node}: #{inspect err}, retrying operation after #{@retry_interval}ms.."
           :timer.sleep @retry_interval
           start_pid_remotely(remote_node, from, name, meta, state, attempts + 1)
+        {:error, :undef} = err ->
+          warn "#{inspect name} could not be started on #{remote_node}: #{inspect err}, retrying operation after #{@retry_interval}ms.."
+          :timer.sleep @retry_interval
+          start_pid_remotely(remote_node, from, name, meta, state, attempts + 1)
         {:error, _reason} = err ->
           warn "#{inspect name} could not be started on #{remote_node}: #{inspect err}"
           reply(from, err)

Is there a better approach that I should consider? Or should I make a PR for this?

slashdotdash commented 6 years ago

Thanks for the bug report and patch @msw10100.

Can you create a pull request for the change and I'll get it merged in?

slashdotdash commented 6 years ago

Fixed by #56.

bitwalker / swarm

Topology change on node startup when process modules not yet loaded #55