bitwalker / swarm

Easy clustering, registration, and distribution of worker processes for Erlang/Elixir
MIT License
1.2k stars 103 forks source link

Unable to resolve conflicting registry when PID is remote #90

Open uberbrodt opened 6 years ago

uberbrodt commented 6 years ago

I am seeing an issue that crops up when we deploy our code, where conflicting registries between two nodes are never resolved. I'm using the StaticQuorumRing strategy, and our deployments drain a node, bring new node back up, repeat. We're using Swarm to make sure one instance of a process is running in the cluster, so our registries look like this:

iex(profile-service-28439@api-cluster-074.load.awse.node.awse.consul)8> Swarm.registered()
[
  {{Commanded.Event.Handler, "profile_kv_repo"}, #PID<0.3132.0>},
  {:profile_event_publisher, #PID<38995.2867.0>},
  {{Commanded.Event.Handler, "profile_assessments"}, #PID<38994.2987.0>},
  {{Commanded.Event.Handler, "event_publication_handler"}, #PID<0.3133.0>},
  {{Commanded.Event.Handler, "repository_handler"}, #PID<38994.2986.0>}
]

Sometimes, two nodes seem to get in an infinite loop of trying to stop and restart one of those processes.

Jun 21 08:17:08 api-cluster-0bb.load.awse ProfileServiceApp: b8dd94f639c0a14a6e53b9ef2fede43fba04a9a8[18963]: 2018-06-21 13:17:07.673 [debug] | [profile-service-30987@api-cluster-0bb.load.awse.node.awse.consul] Attempting to start distributed process: {Commanded.Event.Handler, "profile_kv_repo"} (Commanded.Event.Handler with args %Commanded.Event.Handler{consistency: :strong, handler_module: Profiles.EventHandler.ProfileKVRepoHandler, handler_name: "profile_kv_repo", last_seen_event: nil, subscribe_from: :current, subscription: nil}) 
Jun 21 08:17:08 api-cluster-0bb.load.awse ProfileServiceApp: b8dd94f639c0a14a6e53b9ef2fede43fba04a9a8[18963]: 2018-06-21 13:17:07.674 [debug] | [swarm on profile-service-30987@api-cluster-0bb.load.awse.node.awse.consul] [tracker:handle_call] registering {Commanded.Event.Handler, "profile_kv_repo"} as process started by Elixir.GenServer.start_link/2 with args [Commanded.Event.Handler, %Commanded.Event.Handler{consistency: :strong, handler_module: Profiles.EventHandler.ProfileKVRepoHandler, handler_name: "profile_kv_repo", last_seen_event: nil, subscribe_from: :current, subscription: nil}] 
Jun 21 08:17:08 api-cluster-0bb.load.awse ProfileServiceApp: b8dd94f639c0a14a6e53b9ef2fede43fba04a9a8[18963]: 2018-06-21 13:17:07.674 [debug] | [swarm on profile-service-30987@api-cluster-0bb.load.awse.node.awse.consul] [tracker:do_track] starting {Commanded.Event.Handler, "profile_kv_repo"} on remote node profile-service-28439@api-cluster-074.load.awse.node.awse.consul 
Jun 21 08:17:08 api-cluster-074.load.awse ProfileServiceApp: b8dd94f639c0a14a6e53b9ef2fede43fba04a9a8[15041]: 2018-06-21 13:17:07.675 [debug] | [swarm on profile-service-28439@api-cluster-074.load.awse.node.awse.consul] [tracker:handle_call] :"profile-service-30987@api-cluster-0bb.load.awse.node.awse.consul" is registering {Commanded.Event.Handler, "profile_kv_repo"} as process started by Elixir.GenServer.start_link/2 with args [Commanded.Event.Handler, %Commanded.Event.Handler{consistency: :strong, handler_module: Profiles.EventHandler.ProfileKVRepoHandler, handler_name: "profile_kv_repo", last_seen_event: nil, subscribe_from: :current, subscription: nil}] 
Jun 21 08:17:08 api-cluster-074.load.awse ProfileServiceApp: b8dd94f639c0a14a6e53b9ef2fede43fba04a9a8[15041]: 2018-06-21 13:17:07.675 [debug] | [swarm on profile-service-28439@api-cluster-074.load.awse.node.awse.consul] [tracker:do_track] found {Commanded.Event.Handler, "profile_kv_repo"} already registered on profile-service-28439@api-cluster-074.load.awse.node.awse.consul 
Jun 21 08:17:08 api-cluster-0bb.load.awse ProfileServiceApp: b8dd94f639c0a14a6e53b9ef2fede43fba04a9a8[18963]: 2018-06-21 13:17:07.679 [debug] | [swarm on profile-service-30987@api-cluster-0bb.load.awse.node.awse.consul] [tracker:start_pid_remotely] {Commanded.Event.Handler, "profile_kv_repo"} already registered to #PID<38830.3839.0> on profile-service-28439@api-cluster-074.load.awse.node.awse.consul, registering locally 
Jun 21 08:17:08 api-cluster-0bb.load.awse ProfileServiceApp: b8dd94f639c0a14a6e53b9ef2fede43fba04a9a8[18963]: 2018-06-21 13:17:07.679 [debug] | [swarm on profile-service-30987@api-cluster-0bb.load.awse.node.awse.consul] [tracker:register] conflicting registration for {Commanded.Event.Handler, "profile_kv_repo"}: remote (#PID<38830.3839.0>) vs. local #PID<38831.3124.0> 
Jun 21 08:17:08 api-cluster-0bb.load.awse ProfileServiceApp: b8dd94f639c0a14a6e53b9ef2fede43fba04a9a8[18963]: 2018-06-21 13:17:07.679 [debug] | [:"profile-service-30987@api-cluster-0bb.load.awse.node.awse.consul"] Named process {Commanded.Event.Handler, "profile_kv_repo"} already started on :"profile-service-20722@api-cluster-095.load.awse.node.awse.consul" (#PID<38831.3124.0>) 
Jun 21 08:17:08 api-cluster-0bb.load.awse ProfileServiceApp: b8dd94f639c0a14a6e53b9ef2fede43fba04a9a8[18963]: 2018-06-21 13:17:07.684 [debug] | [profile-service-30987@api-cluster-0bb.load.awse.node.awse.consul] Named process {Commanded.Event.Handler, "profile_kv_repo"} down due to: :noconnection 
Jun 21 08:17:09 api-cluster-0bb.load.awse ProfileServiceApp: b8dd94f639c0a14a6e53b9ef2fede43fba04a9a8[18963]: 2018-06-21 13:17:08.337 [debug] | [swarm on profile-service-30987@api-cluster-0bb.load.awse.node.awse.consul] [tracker:handle_call] untrack #PID<38831.3124.0> 
Jun 21 08:17:09 api-cluster-074.load.awse ProfileServiceApp: b8dd94f639c0a14a6e53b9ef2fede43fba04a9a8[15041]: 2018-06-21 13:17:08.337 [debug] | [swarm on profile-service-28439@api-cluster-074.load.awse.node.awse.consul] [tracker:handle_replica_event] replica event: untrack #PID<38829.3124.0> 
Jun 21 08:17:09 api-cluster-095.load.awse ProfileServiceApp: b8dd94f639c0a14a6e53b9ef2fede43fba04a9a8[5860]: 2018-06-21 13:17:08.341 [debug] | [swarm on profile-service-26153@api-cluster-095.load.awse.node.awse.consul] [tracker:handle_replica_event] replica event: untrack #PID<38992.3124.0> 
Jun 21 08:17:09 api-cluster-095.load.awse ProfileServiceApp: b8dd94f639c0a14a6e53b9ef2fede43fba04a9a8[5860]: 2018-06-21 13:17:08.423 [info] request_id=2kt2uj2q7l6e7oovss000dj1 | GET /monitor/ping 
Jun 21 08:17:09 api-cluster-095.load.awse ProfileServiceApp: b8dd94f639c0a14a6e53b9ef2fede43fba04a9a8[5860]: 2018-06-21 13:17:08.423 [debug] request_id=2kt2uj2q7l6e7oovss000dj1 | Processing with Profile.Web.UtilityController.monitor/2   Parameters: %{}   Pipelines: [] 
Jun 21 08:17:09 api-cluster-095.load.awse ProfileServiceApp: b8dd94f639c0a14a6e53b9ef2fede43fba04a9a8[5860]: 2018-06-21 13:17:08.423 [info] request_id=2kt2uj2q7l6e7oovss000dj1 | Sent 200 in 225µs 
Jun 21 08:17:09 api-cluster-0bb.load.awse ProfileServiceApp: b8dd94f639c0a14a6e53b9ef2fede43fba04a9a8[18963]: 2018-06-21 13:17:08.685 [debug] | [profile-service-30987@api-cluster-0bb.load.awse.node.awse.consul] Attempting to start distributed process: {Commanded.Event.Handler, "profile_kv_repo"} (Commanded.Event.Handler with args %Commanded.Event.Handler{consistency: :strong, handler_module: Profiles.EventHandler.ProfileKVRepoHandler, handler_name: "profile_kv_repo", last_seen_event: nil, subscribe_from: :current, subscription: nil}) 
Jun 21 08:17:09 api-cluster-0bb.load.awse ProfileServiceApp: b8dd94f639c0a14a6e53b9ef2fede43fba04a9a8[18963]: 2018-06-21 13:17:08.685 [debug] | [swarm on profile-service-30987@api-cluster-0bb.load.awse.node.awse.consul] [tracker:handle_call] registering {Commanded.Event.Handler, "profile_kv_repo"} as process started by Elixir.GenServer.start_link/2 with args [Commanded.Event.Handler, %Commanded.Event.Handler{consistency: :strong, handler_module: Profiles.EventHandler.ProfileKVRepoHandler, handler_name: "profile_kv_repo", last_seen_event: nil, subscribe_from: :current, subscription: nil}] 
Jun 21 08:17:09 api-cluster-0bb.load.awse ProfileServiceApp: b8dd94f639c0a14a6e53b9ef2fede43fba04a9a8[18963]: 2018-06-21 13:17:08.685 [debug] | [swarm on profile-service-30987@api-cluster-0bb.load.awse.node.awse.consul] [tracker:do_track] starting {Commanded.Event.Handler, "profile_kv_repo"} on remote node profile-service-28439@api-cluster-074.load.awse.node.awse.consul 

And the above goes on and on...

Remote shelling in, I was able to see that both nodes are communicating, but one node has a registry like this:

iex(profile-service-30987@api-cluster-0bb.load.awse.node.awse.consul)14> Swarm.registered()
[
  {{Commanded.Event.Handler, "profile_kv_repo"}, #PID<38831.3124.0>},
  {:profile_event_publisher, #PID<38994.2867.0>},
  {{Commanded.Event.Handler, "profile_assessments"}, #PID<0.2987.0>},
  {{Commanded.Event.Handler, "event_publication_handler"}, #PID<38830.3133.0>},
  {{Commanded.Event.Handler, "repository_handler"}, #PID<0.2986.0>}
]

So the issue here, is that the remote process indicated for profile_kv_repo didn't exist. But the code that handles these sorts of conflicts here does nothing if the conflicting node is remote, which I believe sets it up for a loop of trying to start that particular process again. Doing Swarm.unregister_name({Commanded.Event.Handler, "profile_kv_repo"}) on the node with the bad entry resolved the issue.

So a couple of questions:

  1. How did it get in this state? Is there a deployment process we should follow to fix this?
  2. Is there something that Swarm can do to resolve this issue? Maybe when Swarm detects a process is down, it can remove it's registration for that pid and then attempt to restart?