bitwalker / swarm

Easy clustering, registration, and distribution of worker processes for Erlang/Elixir
MIT License
1.19k stars 102 forks source link

Process migration not working on second failover scenarios #93

Closed beardedeagle closed 6 years ago

beardedeagle commented 6 years ago

This issue greatly resembles #89 but isn't quite the same in how it presents.

Timeline:

  1. Bring up test01 and test02 BEAM vm's (this completes fine)
  2. Create worker proc (this completes fine and the proc is assigned to one of the vm's)
  3. Kill vm worker proc is on causing worker proc to migrate (this works fine)
  4. Bring back up vm that was killed (this works fine)
  5. After a second or so see untrack due to :restart reason

Logs:

test01:

~> PORT=4001 DB_DIR=priv/test01 iex --name test01@127.0.0.1 --cookie test -S mix phx.server
Erlang/OTP 20 [erts-9.3.3.1] [source] [64-bit] [smp:8:8] [ds:8:8:10] [async-threads:10] [hipe] [kernel-poll:false]

[info] [libcluster:vanguard] connected to :"test01@127.0.0.1"
[info] [libcluster:vanguard] connected to :"test02@127.0.0.1"
[info] Application mnesia exited: :stopped
[info] [swarm on test01@127.0.0.1] [tracker:init] started
[info] Running VanguardWeb.Endpoint with Cowboy using http://0.0.0.0:4001
Interactive Elixir (1.6.6) - press Ctrl+C to exit (type h() ENTER for help)
iex(test01@127.0.0.1)1> 10:09:47 - info: compiled 6 files into 2 files, copied 3 in 792 ms
[info] [swarm on test01@127.0.0.1] [tracker:cluster_wait] joining cluster..
[info] [swarm on test01@127.0.0.1] [tracker:cluster_wait] found connected nodes: [:"test02@127.0.0.1"]
[info] [swarm on test01@127.0.0.1] [tracker:cluster_wait] selected sync node: test02@127.0.0.1
[info] [swarm on test01@127.0.0.1] [tracker:syncing] received registry from test02@127.0.0.1, merging..
[info] [swarm on test01@127.0.0.1] [tracker:syncing] local synchronization with test02@127.0.0.1 complete!
[info] [swarm on test01@127.0.0.1] [tracker:resolve_pending_sync_requests] pending sync requests cleared
[info] GET /api/graphiql
[info] Sent 200 in 28ms
[info] POST /api/graphiql
[debug] ABSINTHE schema=VanguardWeb.Schema variables=%{}
---
query IntrospectionQuery {
    __schema {
      queryType { name }
      mutationType { name }
      subscriptionType { name }
      types {
        ...FullType
      }
      directives {
        name
        description
        locations
        args {
          ...InputValue
        }
      }
    }
  }

  fragment FullType on __Type {
    kind
    name
    description
    fields(includeDeprecated: true) {
      name
      description
      args {
        ...InputValue
      }
      type {
        ...TypeRef
      }
      isDeprecated
      deprecationReason
    }
    inputFields {
      ...InputValue
    }
    interfaces {
      ...TypeRef
    }
    enumValues(includeDeprecated: true) {
      name
      description
      isDeprecated
      deprecationReason
    }
    possibleTypes {
      ...TypeRef
    }
  }

  fragment InputValue on __InputValue {
    name
    description
    type { ...TypeRef }
    defaultValue
  }

  fragment TypeRef on __Type {
    kind
    name
    ofType {
      kind
      name
      ofType {
        kind
        name
        ofType {
          kind
          name
          ofType {
            kind
            name
            ofType {
              kind
              name
              ofType {
                kind
                name
                ofType {
                  kind
                  name
                }
              }
            }
          }
        }
      }
    }
  }
---
[info] Sent 200 in 243ms
[info] POST /api/graphiql
[debug] ABSINTHE schema=VanguardWeb.Schema variables=%{}
---
mutation {
  addOwner(name: "Test") {
    createdAt
    name
    ownerId
  }
}
---
[info] Sent 200 in 23ms
[info] POST /api/graphiql
[debug] ABSINTHE schema=VanguardWeb.Schema variables=%{}
---
mutation {
  addCheck (
    checkData: {
      path: "/bin/echo"
      opts: ["arg1"]
      env: "dev"
      passAll: true
    }
    dc: "phx3"
    handlerId: []
    host: "Vanguard"
    interval: 10000
    ownerId: "31d9227d-d64f-4e16-89f6-0145917e3778"
    retention: 10
    timeout: 3000
  ) {
    checkId
  }
}
---
[info] Sent 200 in 20ms
[debug] [swarm on test01@127.0.0.1] [tracker:handle_call] registering "b3f7d8d2-d37b-48eb-b1ef-853aaa49ed4b" as process started by Elixir.Vanguard.Check.Registrar.register/1 with args ["b3f7d8d2-d37b-48eb-b1ef-853aaa49ed4b"]
[debug] [swarm on test01@127.0.0.1] [tracker:do_track] starting "b3f7d8d2-d37b-48eb-b1ef-853aaa49ed4b" on remote node test02@127.0.0.1
[debug] [swarm on test01@127.0.0.1] [tracker:handle_replica_event] replicating registration for "b3f7d8d2-d37b-48eb-b1ef-853aaa49ed4b" (#PID<14055.575.0>) locally
[debug] [swarm on test01@127.0.0.1] [tracker:start_pid_remotely] "b3f7d8d2-d37b-48eb-b1ef-853aaa49ed4b" already registered to #PID<14055.575.0> on test02@127.0.0.1, registering locally
iex(test01@127.0.0.1)4> Swarm.registered
[{"b3f7d8d2-d37b-48eb-b1ef-853aaa49ed4b", #PID<14055.575.0>}]
iex(test01@127.0.0.1)5>
BREAK: (a)bort (c)ontinue (p)roc info (i)nfo (l)oaded
       (v)ersion (k)ill (D)b-tables (d)istribution
^C%                                                                                                                                                                                                                                                                                                                                                                          ~> PORT=4001 DB_DIR=priv/test01 iex --name test01@127.0.0.1 --cookie test -S mix phx.server
Erlang/OTP 20 [erts-9.3.3.1] [source] [64-bit] [smp:8:8] [ds:8:8:10] [async-threads:10] [hipe] [kernel-poll:false]

[info] [libcluster:vanguard] connected to :"test01@127.0.0.1"
[info] [libcluster:vanguard] connected to :"test02@127.0.0.1"
[info] Application mnesia exited: :stopped
[info] [swarm on test01@127.0.0.1] [tracker:init] started
[info] Running VanguardWeb.Endpoint with Cowboy using http://0.0.0.0:4001
Interactive Elixir (1.6.6) - press Ctrl+C to exit (type h() ENTER for help)
iex(test01@127.0.0.1)1> 10:12:18 - info: compiled 6 files into 2 files, copied 3 in 790 ms
[info] [swarm on test01@127.0.0.1] [tracker:cluster_wait] joining cluster..
[info] [swarm on test01@127.0.0.1] [tracker:cluster_wait] found connected nodes: [:"test02@127.0.0.1"]
[info] [swarm on test01@127.0.0.1] [tracker:cluster_wait] selected sync node: test02@127.0.0.1
[info] [swarm on test01@127.0.0.1] [tracker:syncing] received registry from test02@127.0.0.1, merging..
[debug] [swarm on test01@127.0.0.1] [tracker:sync_registry] local tracker is missing "b3f7d8d2-d37b-48eb-b1ef-853aaa49ed4b", adding to registry
[info] [swarm on test01@127.0.0.1] [tracker:syncing] local synchronization with test02@127.0.0.1 complete!
[info] [swarm on test01@127.0.0.1] [tracker:resolve_pending_sync_requests] pending sync requests cleared
iex(test01@127.0.0.1)1> Swarm.registered
[{"b3f7d8d2-d37b-48eb-b1ef-853aaa49ed4b", #PID<14055.575.0>}]
iex(test01@127.0.0.1)2> [debug] [swarm on test01@127.0.0.1] [tracker:handle_monitor] "b3f7d8d2-d37b-48eb-b1ef-853aaa49ed4b" is down: :restart
[debug] [swarm on test01@127.0.0.1] [tracker:handle_replica_event] replica event: untrack #PID<14055.575.0>
[debug] [swarm on test01@127.0.0.1] [tracker:handle_cast] received sync request from test02@127.0.0.1
[info] [swarm on test01@127.0.0.1] [tracker:awaiting_sync_ack] received sync acknowledgement from test02@127.0.0.1, syncing with remote registry
[info] [swarm on test01@127.0.0.1] [tracker:awaiting_sync_ack] local synchronization with test02@127.0.0.1 complete!
[info] [swarm on test01@127.0.0.1] [tracker:resolve_pending_sync_requests] pending sync requests cleared
[debug] [swarm on test01@127.0.0.1] [tracker:handle_cast] received sync request from test02@127.0.0.1
[info] [swarm on test01@127.0.0.1] [tracker:awaiting_sync_ack] received sync acknowledgement from test02@127.0.0.1, syncing with remote registry
[info] [swarm on test01@127.0.0.1] [tracker:awaiting_sync_ack] local synchronization with test02@127.0.0.1 complete!
[info] [swarm on test01@127.0.0.1] [tracker:resolve_pending_sync_requests] pending sync requests cleared
[debug] [swarm on test01@127.0.0.1] [tracker:handle_cast] received sync request from test02@127.0.0.1
[info] [swarm on test01@127.0.0.1] [tracker:awaiting_sync_ack] received sync acknowledgement from test02@127.0.0.1, syncing with remote registry
[info] [swarm on test01@127.0.0.1] [tracker:awaiting_sync_ack] local synchronization with test02@127.0.0.1 complete!
[info] [swarm on test01@127.0.0.1] [tracker:resolve_pending_sync_requests] pending sync requests cleared
[debug] [swarm on test01@127.0.0.1] [tracker:handle_cast] received sync request from test02@127.0.0.1
[info] [swarm on test01@127.0.0.1] [tracker:awaiting_sync_ack] received sync acknowledgement from test02@127.0.0.1, syncing with remote registry
[info] [swarm on test01@127.0.0.1] [tracker:awaiting_sync_ack] local synchronization with test02@127.0.0.1 complete!
[info] [swarm on test01@127.0.0.1] [tracker:resolve_pending_sync_requests] pending sync requests cleared
[debug] [swarm on test01@127.0.0.1] [tracker:handle_cast] received sync request from test02@127.0.0.1
[info] [swarm on test01@127.0.0.1] [tracker:awaiting_sync_ack] received sync acknowledgement from test02@127.0.0.1, syncing with remote registry
[info] [swarm on test01@127.0.0.1] [tracker:awaiting_sync_ack] local synchronization with test02@127.0.0.1 complete!
[info] [swarm on test01@127.0.0.1] [tracker:resolve_pending_sync_requests] pending sync requests cleared

test02:

~> PORT=4000 DB_DIR=priv/test02 iex --name test02@127.0.0.1 --cookie test -S mix phx.server
Erlang/OTP 20 [erts-9.3.3.1] [source] [64-bit] [smp:8:8] [ds:8:8:10] [async-threads:10] [hipe] [kernel-poll:false]

[warn] [libcluster:vanguard] unable to connect to :"test01@127.0.0.1"
[info] [libcluster:vanguard] connected to :"test02@127.0.0.1"
[info] [swarm on test02@127.0.0.1] [tracker:init] started
[info] Running VanguardWeb.Endpoint with Cowboy using http://0.0.0.0:4000
Interactive Elixir (1.6.6) - press Ctrl+C to exit (type h() ENTER for help)
iex(test02@127.0.0.1)1> 10:09:38 - info: compiled 6 files into 2 files, copied 3 in 1.1 sec
[info] [swarm on test02@127.0.0.1] [tracker:cluster_wait] joining cluster..
[info] [swarm on test02@127.0.0.1] [tracker:cluster_wait] no connected nodes, proceeding without sync
[debug] [swarm on test02@127.0.0.1] [tracker:ensure_swarm_started_on_remote_node] nodeup for test01@127.0.0.1 was ignored because swarm not started yet, will retry in 1000ms..
[info] [swarm on test02@127.0.0.1] [tracker:cluster_wait] joining cluster..
[info] [swarm on test02@127.0.0.1] [tracker:cluster_wait] no connected nodes, proceeding without sync
[info] [swarm on test02@127.0.0.1] [tracker:ensure_swarm_started_on_remote_node] nodeup test01@127.0.0.1
[debug] [swarm on test02@127.0.0.1] [tracker:handle_topology_change] topology change (nodeup for test01@127.0.0.1)
[info] [swarm on test02@127.0.0.1] [tracker:handle_topology_change] topology change complete
[debug] [swarm on test02@127.0.0.1] [tracker:handle_cast] received sync request from test01@127.0.0.1
[info] [swarm on test02@127.0.0.1] [tracker:awaiting_sync_ack] received sync acknowledgement from test01@127.0.0.1, syncing with remote registry
[info] [swarm on test02@127.0.0.1] [tracker:awaiting_sync_ack] local synchronization with test01@127.0.0.1 complete!
[info] [swarm on test02@127.0.0.1] [tracker:resolve_pending_sync_requests] pending sync requests cleared
[debug] [swarm on test02@127.0.0.1] [tracker:handle_call] registering "b3f7d8d2-d37b-48eb-b1ef-853aaa49ed4b" as process started by Elixir.Vanguard.Check.Registrar.register/1 with args ["b3f7d8d2-d37b-48eb-b1ef-853aaa49ed4b"]
[debug] [swarm on test02@127.0.0.1] [tracker:do_track] starting "b3f7d8d2-d37b-48eb-b1ef-853aaa49ed4b" on test02@127.0.0.1
[debug] [swarm on test02@127.0.0.1] [tracker:do_track] started "b3f7d8d2-d37b-48eb-b1ef-853aaa49ed4b" on test02@127.0.0.1
[debug] [swarm on test02@127.0.0.1] [tracker:handle_call] :"test01@127.0.0.1" is registering "b3f7d8d2-d37b-48eb-b1ef-853aaa49ed4b" as process started by Elixir.Vanguard.Check.Registrar.register/1 with args ["b3f7d8d2-d37b-48eb-b1ef-853aaa49ed4b"]
[debug] [swarm on test02@127.0.0.1] [tracker:do_track] found "b3f7d8d2-d37b-48eb-b1ef-853aaa49ed4b" already registered on test02@127.0.0.1
iex(test02@127.0.0.1)4> Swarm.registered
[{"b3f7d8d2-d37b-48eb-b1ef-853aaa49ed4b", #PID<0.575.0>}]
iex(test02@127.0.0.1)5> [info] [swarm on test02@127.0.0.1] [tracker:nodedown] nodedown test01@127.0.0.1
[debug] [swarm on test02@127.0.0.1] [tracker:handle_topology_change] topology change (nodedown for test01@127.0.0.1)
[info] [swarm on test02@127.0.0.1] [tracker:handle_topology_change] topology change complete
iex(test02@127.0.0.1)6> Swarm.registered
[{"b3f7d8d2-d37b-48eb-b1ef-853aaa49ed4b", #PID<0.575.0>}]
iex(test02@127.0.0.1)7> [debug] [swarm on test02@127.0.0.1] [tracker:ensure_swarm_started_on_remote_node] nodeup for test01@127.0.0.1 was ignored because swarm not started yet, will retry in 1000ms..
[info] [swarm on test02@127.0.0.1] [tracker:cluster_wait] joining cluster..
[info] [swarm on test02@127.0.0.1] [tracker:cluster_wait] no connected nodes, proceeding without sync
[info] [swarm on test02@127.0.0.1] [tracker:ensure_swarm_started_on_remote_node] nodeup test01@127.0.0.1
[debug] [swarm on test02@127.0.0.1] [tracker:handle_topology_change] topology change (nodeup for test01@127.0.0.1)
[info] [swarm on test02@127.0.0.1] [tracker:handle_topology_change] topology change complete
[debug] [swarm on test02@127.0.0.1] [tracker:handle_cast] received sync request from test01@127.0.0.1
[info] [swarm on test02@127.0.0.1] [tracker:awaiting_sync_ack] received sync acknowledgement from test01@127.0.0.1, syncing with remote registry
[info] [swarm on test02@127.0.0.1] [tracker:awaiting_sync_ack] local synchronization with test01@127.0.0.1 complete!
[info] [swarm on test02@127.0.0.1] [tracker:resolve_pending_sync_requests] pending sync requests cleared
iex(test02@127.0.0.1)7> Swarm.registered
[{"b3f7d8d2-d37b-48eb-b1ef-853aaa49ed4b", #PID<0.575.0>}]
iex(test02@127.0.0.1)8> Swarm.registered
[{"b3f7d8d2-d37b-48eb-b1ef-853aaa49ed4b", #PID<0.575.0>}]
iex(test02@127.0.0.1)9> [debug] [swarm on test02@127.0.0.1] [tracker:handle_monitor] "b3f7d8d2-d37b-48eb-b1ef-853aaa49ed4b" is down: :restart
[debug] [swarm on test02@127.0.0.1] [tracker:handle_replica_event] replica event: untrack #PID<0.575.0>
[info] [swarm on test02@127.0.0.1] [tracker:anti_entropy] syncing with test01@127.0.0.1
[info] [swarm on test02@127.0.0.1] [tracker:syncing] received registry from test01@127.0.0.1, merging..
[info] [swarm on test02@127.0.0.1] [tracker:syncing] local synchronization with test01@127.0.0.1 complete!
[info] [swarm on test02@127.0.0.1] [tracker:resolve_pending_sync_requests] pending sync requests cleared
[info] [swarm on test02@127.0.0.1] [tracker:anti_entropy] syncing with test01@127.0.0.1
[info] [swarm on test02@127.0.0.1] [tracker:syncing] received registry from test01@127.0.0.1, merging..
[info] [swarm on test02@127.0.0.1] [tracker:syncing] local synchronization with test01@127.0.0.1 complete!
[info] [swarm on test02@127.0.0.1] [tracker:resolve_pending_sync_requests] pending sync requests cleared
[info] [swarm on test02@127.0.0.1] [tracker:anti_entropy] syncing with test01@127.0.0.1
[info] [swarm on test02@127.0.0.1] [tracker:syncing] received registry from test01@127.0.0.1, merging..
[info] [swarm on test02@127.0.0.1] [tracker:syncing] local synchronization with test01@127.0.0.1 complete!
[info] [swarm on test02@127.0.0.1] [tracker:resolve_pending_sync_requests] pending sync requests cleared
[info] [swarm on test02@127.0.0.1] [tracker:anti_entropy] syncing with test01@127.0.0.1
[info] [swarm on test02@127.0.0.1] [tracker:syncing] received registry from test01@127.0.0.1, merging..
[info] [swarm on test02@127.0.0.1] [tracker:syncing] local synchronization with test01@127.0.0.1 complete!
[info] [swarm on test02@127.0.0.1] [tracker:resolve_pending_sync_requests] pending sync requests cleared
[info] [swarm on test02@127.0.0.1] [tracker:anti_entropy] syncing with test01@127.0.0.1
[info] [swarm on test02@127.0.0.1] [tracker:syncing] received registry from test01@127.0.0.1, merging..
[info] [swarm on test02@127.0.0.1] [tracker:syncing] local synchronization with test01@127.0.0.1 complete!
[info] [swarm on test02@127.0.0.1] [tracker:resolve_pending_sync_requests] pending sync requests cleared
[info] [swarm on test02@127.0.0.1] [tracker:nodedown] nodedown test01@127.0.0.1
[debug] [swarm on test02@127.0.0.1] [tracker:handle_topology_change] topology change (nodedown for test01@127.0.0.1)
[info] [swarm on test02@127.0.0.1] [tracker:handle_topology_change] topology change complete

If need be I can push my code somewhere for testing.

ourway commented 6 years ago

Any updates?

beardedeagle commented 6 years ago

Turns out, for me, this was an issue with my startup module which is meant to catch any orphaned processes on startup (fixed code below, effectively instead of blindly sending to my initialize module where I was issuing a Process.exit(PID, :restart), I'm not ensuring it isn't already running in swarm):

defmodule Vanguard.Check.Startup do
  @moduledoc """
  `Vanguard.Check.Startup` is a task that starts checks which may not be running on startup.
  """
  use Task

  alias Vanguard.Db.CheckStore

  def start_link(_args) do
    Task.start_link(__MODULE__, :run, [])
  end

  def run do
    :mnesia.wait_for_tables([CheckStore], 15_000)

    {:ok, records} = CheckStore.read_check()

    Enum.each(records, fn record ->
      case Swarm.whereis_name(Map.get(record, :check_id)) do
        :undefined ->
          GenServer.call(
            Vanguard.Check.Initialize,
            {:initialize, Map.get(record, :check_id), Map.get(record, :check_state)}
          )

        _ ->
          :ok
      end
    end)
  end
end

Don't know if this gives you any hints @ourway

bitwalker commented 6 years ago

@beardedeagle Is this resolved or do you believe there is still an issue here?

beardedeagle commented 6 years ago

The issue is solved. I left open to potentially help others. Closing.