bitwalker / swarm

Easy clustering, registration, and distribution of worker processes for Erlang/Elixir
MIT License
1.2k stars 103 forks source link

Function clause error in `Swarm.IntervalTreeClock.fill` on topology change #126

Open devonestes opened 5 years ago

devonestes commented 5 years ago

Howdy!

Our error tracker picked up the following error that happened when we deployed our application yesterday.

We're using Swarm (version 3.4.0) to manage a two node cluster on Kubernetes. When the first node was shut down as part of the rolling deploy, the following ErlangError was raised (I've formatted it a bit to make it easier to read):

{{:function_clause,
  [
    {Swarm.IntervalTreeClock, :fill, [{1, 0}, {1, 81714}],
     [file: 'lib/swarm/tracker/crdt.ex', line: 170]},
    {Swarm.IntervalTreeClock, :event, 1, [file: 'lib/swarm/tracker/crdt.ex', line: 52]},
    {Swarm.Tracker, :remove_registration, 2, [file: 'lib/swarm/tracker/tracker.ex', line: 1494]},
    {Swarm.Tracker, :"-handle_topology_change/2-fun-0-", 3,
     [file: 'lib/swarm/tracker/tracker.ex', line: 911]},
    {:lists, :foldl, 3, [file: 'lists.erl', line: 1263]},
    {:ets, :do_foldl, 4, [file: 'ets.erl', line: 638]},
    {:ets, :foldl, 3, [file: 'ets.erl', line: 627]},
    {Swarm.Tracker, :handle_topology_change, 2, [file: 'lib/swarm/tracker/tracker.ex', line: 840]}
  ]},
 {:gen_statem, :call,
  [
    Swarm.Tracker,
    {:track, #Reference<0.2697759696.1380188162.160667>,
     %{
       mfa:
         {GenServer, :start_link,
          [
            PotionWeb.Devices.SendAfter,
            {"device::id",
             {PotionWeb.Devices.Worker, :device_offline, []}, 30000}
          ]}
     }},
    :infinity
  ]}}

It looks like a function clause error where Swarm.IntervalTreeClock.fill({1, 0}, {1, 81714}) was called, but none of the function clauses match:

  defp fill(0, e), do: e
  defp fill(1, {_, _, _} = e), do: height(e)
  defp fill(_, n) when is_integer(n), do: n

  defp fill({1, r}, {n, el, er}) do
    er1 = fill(r, er)
    d = max(height(el), base(er1))
    norm_ev({n, d, er1})
  end

  defp fill({l, 1}, {n, el, er}) do
    el1 = fill(l, el)
    d = max(height(er), base(el1))
    norm_ev({n, el1, d})
  end

  defp fill({l, r}, {n, el, er}) do
    norm_ev({n, fill(l, el), fill(r, er)})
  end

This is pretty deep in the CRDT weeds so I don't think I'm going to be able to help much in figuring out how that function got called with those arguments, but I'd be happy to help with any other information about the application that can help in fixing this.

spunkedy commented 5 years ago

I have this issue as well but on 3.3.1 Hope this extra info might help whoever.

Process #PID<0.9499.2> terminating: {{:function_clause, [{Swarm.IntervalTreeClock, :fill, [{0, 18}, {1, 18}], [file: 'lib/swarm/tracker/crdt.ex', line: 164]}, {Swarm.IntervalTreeClock, :event, 1, [file: 'lib/swarm/tracker/crdt.ex', line: 53]}, {Swarm.Tracker, :"-sync_registry/4-fun-0-", 4, [file: 'lib/swarm/tracker/tracker.ex', line: 371]}, {Enum, :"-reduce/3-lists^foldl/2-0-", 3, [file: 'lib/enum.ex', line: 1925]}, {Swarm.Tracker, :awaiting_sync_ack, 3, [file: 'lib/swarm/tracker/tracker.ex', line: 488]}, {:gen_statem, :call_state_function, 5, [file: 'gen_statem.erl', line: 1660]}, {:gen_statem, :loop_event_state_function, 6, [file: 'gen_statem.erl', line: 1023]}, {:proc_lib, :init_p_do_apply, 3, [file: 'proc_lib.erl', line: 249]}]}, {:gen_statem, :call, [Swarm.Tracker, {:track, Backend.Scheduler.ExecutionBroadcaster, %{mfa: {Quantum.ExecutionBroadcaster, :start_link, [%Quantum.ExecutionBroadcaster.StartOpts{debug_logging: true, job_broadcaster_reference: {:via, :swarm, Backend.Scheduler.JobBroadcaster}, name: Backend.Scheduler.ExecutionBroadcaster, scheduler: Backend.Scheduler, storage: Quantum.Storage.Noop}]}}}, :infinity]}}

My configuration

          backend: [
            strategy: Cluster.Strategy.Kubernetes.DNS,
              config: [
                service: System.get_env("K8S_CLUSTER_SERVICE"),
                application_name: System.get_env("K8S_CLUSTER_APP_NAME"),
                polling_interval: 10_000
              ]
            ]
          ]
markmeeus commented 5 years ago

We are also having this issue on 3.3.1. Running 2 nodes on OpenShift (Kubernetes) with libCluster Kubernetes.DNS (headless service) Strategy In our case, the error Swarm.IntervalTreeClock.fill didn't go away without restarting the services.

Probably related is that we experienced the following crash right before the Swarm.IntervalTreeClock.fill error.

{
  {
    {:badmatch, false}, 
    [
      {Swarm.Registry, :new!, 1, [file: 'lib/swarm/registry.ex', line: 99]}, 
      {Swarm.Tracker, :handle_replica_event, 4, [file: 'lib/swarm/tracker/tracker.ex', line: 814]}, 
      {:gen_statem, :call_state_function, 5, [file: 'gen_statem.erl', line: 1240]}, 
      {:gen_statem, :loop_event, 6, [file: 'gen_statem.erl', line: 1012]}, 
      {:proc_lib, :init_p_do_apply, 3, [file: 'proc_lib.erl', line: 247]}]}, 
      {:gen_statem, :call, [
        Swarm.Tracker, {:track, "<our_id>", 
        %{mfa: {Cadex.Swarm.Supervisor, :start_event_processor, ["PH06407805_2019"]}}}, :infinity]}
}

Seems like Registry tries to insert a record in ets with a duplicate key?

Some of our processes exit immediately after being started, they assume that there isn't anything left to do. But in some edge cases, new data may come in, and the process is recreated. I did notice that this specific process seems to have stopped immediately after being started. Maybe it didn't get cleaned up on all nodes before it got recreated again? Also, a network split seems to have occured, because I found this in the logs at the time the error started:

[swarm on cadex_service@server2] [tracker:handle_cast] received sync request from cadex_service@server1
[swarm on cadex_service@server1] [tracker:cluster_wait] selected sync node: cadex_service@server2
[swarm on cadex_service@server1] [tracker:cluster_wait] found connected nodes: [:"cadex_service@server2"]
[swarm on cadex_service@server1] [tracker:cluster_wait] joining cluster..
markmeeus commented 5 years ago

We have upgraded to 3.4, and now it seems to derail even more quickly. These issues occur on 2 fixed nodes (not K8s).

beside the IntervalTreeClock (343,584 log entries in 6 hours, we are also getting lots of these:

mismatched pids, local clock conflicts with remote clock, event unhandled"

and these:

received track event for "<some-id>", but local clock conflicts with remote clock, event unhandled

and these

failed to start "some-id" on cadex_service@13.41.2.18: nodedown, retrying operation..

And also, some tracked processes are actually gone:

(EXIT) no process: the process is not alive or there's no process currently associated with the given name, possibly because its application isn't started

I'll try to get the states from the system

I am also wondering what would happen if a node restarts and starts creating processes before it joined the cluster? When it rejoins I can imagine the other nodes not being aware that other restarted ? Could there be some sort of race condition between rejoining after a restart and registering new processes that may have been tracked before?

markmeeus commented 5 years ago

Sorry for diluting this ticket, It looks like the IntervalTreeClock errors on fill and sum are gone after upgrading to 3.4. We did some refactoring in our code so that we don't join groups when a process is allready a joined in the group, and fixed some processes crashing in the :die event. Seems to solve most of our issues.

When we have bursts in data, we still get these warnings:

received track event for "<some-id>", but local clock conflicts with remote clock, event unhandled

Not sure how to interpret these, since everything seems to work out fine.