distribworks / dkron

Dkron - Distributed, fault tolerant job scheduling system https://dkron.io
GNU Lesser General Public License v3.0
4.28k stars 379 forks source link

Failover triggers on member join event #683

Open andreygolev opened 4 years ago

andreygolev commented 4 years ago

Describe the bug Hi, I'm trying to make Dkron work in Kubernetes environments, but so far not so much luck there. Currently I'm trying to run Dkron server part with "Deployment" kind in Kubernetes which means that hostname and IP address of container will be random for each container in this deployment. Because of these specifics, I'm running Dkron, 3 instances, with auto-join with the following config:

"dkron agent --server --retry-join="provider=k8s label_selector="app=dkron" namespace="dkron" --bootstrap-expect=3"

Also, no data persistence is enabled in containers. So, each new container joins with empty data dir and new hostname and IP. First, initial start of 3 instances and bootstrapping works just fine.

Bad things start to happen when I delete or add more Dkron servers to cluster.

This is a log of a new container that's is joining to an existing cluster: dkron-server-794cd48678-4drnv

time="2020-02-17T10:14:04Z" level=info msg="No valid config found: Applying default values." error="Config File \"dkron\" Not Found in \"[/etc/dkron /root/.dkron /opt/dkron/bin/config]\""
time="2020-02-17T10:14:04Z" level=info msg="agent: Dkron agent starting" node=dkron-server-794cd48678-4drnv
time="2020-02-17T10:14:04Z" level=info msg="agent: Retry join LAN is supported for: aliyun aws azure digitalocean gce k8s linode mdns os packet scaleway softlayer triton vsphere" node=dkron-server-794cd48678-4drnv
time="2020-02-17T10:14:04Z" level=info msg="agent: Joining cluster..." cluster=LAN node=dkron-server-794cd48678-4drnv
time="2020-02-17T10:14:04Z" level=info msg="2020/02/17 10:14:04 discover.go:176: [DEBUG] discover: Using provider \"k8s\""
time="2020-02-17T10:14:04Z" level=info msg="2020/02/17 10:14:04 k8s_discover.go:138: [DEBUG] discover-k8s: ignoring pod \"dkron-server-794cd48678-4drnv\", not running: \"Pending\""
time="2020-02-17T10:14:04Z" level=info msg="2020/02/17 10:14:04 k8s_discover.go:138: [DEBUG] discover-k8s: ignoring pod \"dkron-server-794cd48678-lbbmg\", not running: \"Pending\""
time="2020-02-17T10:14:04Z" level=info msg="2020/02/17 10:14:04 k8s_discover.go:138: [DEBUG] discover-k8s: ignoring pod \"dkron-server-794cd48678-mgww8\", not running: \"Pending\""
time="2020-02-17T10:14:04Z" level=info msg="agent: Discovered LAN servers: " node=dkron-server-794cd48678-4drnv
time="2020-02-17T10:14:04Z" level=warning msg="agent: Join LAN failed: No servers to join, retrying in 30s" node=dkron-server-794cd48678-4drnv
time="2020-02-17T10:14:34Z" level=info msg="2020/02/17 10:14:34 discover.go:176: [DEBUG] discover: Using provider \"k8s\""
time="2020-02-17T10:14:34Z" level=info msg="agent: Discovered LAN servers: 172.25.8.124 172.25.7.82 172.25.5.13" node=dkron-server-794cd48678-4drnv
time="2020-02-17T10:14:34Z" level=info msg="agent: Join LAN completed. Synced with 3 initial agents" node=dkron-server-794cd48678-4drnv
time="2020-02-17T10:14:34Z" level=info msg="All 0 tables opened in 0s\n" node=dkron-server-794cd48678-4drnv
time="2020-02-17T10:14:34Z" level=info msg="DropAll called. Blocking writes..." node=dkron-server-794cd48678-4drnv
time="2020-02-17T10:14:34Z" level=info msg="Writes flushed. Stopping compactions now..." node=dkron-server-794cd48678-4drnv
time="2020-02-17T10:14:34Z" level=info msg="Deleted 0 SSTables. Now deleting value logs...\n" node=dkron-server-794cd48678-4drnv
time="2020-02-17T10:14:34Z" level=info msg="Value logs deleted. Creating value log file: 0" node=dkron-server-794cd48678-4drnv
time="2020-02-17T10:14:34Z" level=info msg="Deleted 1 value log files. DropAll done.\n" node=dkron-server-794cd48678-4drnv
time="2020-02-17T10:14:34Z" level=info msg="Resuming writes" node=dkron-server-794cd48678-4drnv
time="2020-02-17T10:14:34Z" level=info msg="api: Running HTTP server" address=":8080" node=dkron-server-794cd48678-4drnv
time="2020-02-17T10:14:34Z" level=info msg="dkron: monitoring leadership" node=dkron-server-794cd48678-4drnv
time="2020-02-17T10:14:35Z" level=info msg="agent: Listen for events" node=dkron-server-794cd48678-4drnv
time="2020-02-17T10:14:35Z" level=info msg="agent: Received event" event=member-join node=dkron-server-794cd48678-4drnv
time="2020-02-17T10:14:35Z" level=info msg="adding server" node=dkron-server-794cd48678-4drnv server=dkron-server-794cd48678-4drnv
time="2020-02-17T10:14:35Z" level=info msg="agent: found expected number of peers, attempting to bootstrap cluster...peers172.25.8.124:6868,172.25.5.13:6868,172.25.7.82:6868" node=dkron-server-794cd48678-4drnv

dkron-server-794cd48678-lbbmg

time="2020-02-17T10:14:04Z" level=info msg="No valid config found: Applying default values." error="Config File \"dkron\" Not Found in \"[/etc/dkron /root/.dkron /opt/dkron/bin/config]\""
time="2020-02-17T10:14:04Z" level=info msg="agent: Dkron agent starting" node=dkron-server-794cd48678-lbbmg
time="2020-02-17T10:14:04Z" level=info msg="agent: Retry join LAN is supported for: aliyun aws azure digitalocean gce k8s linode mdns os packet scaleway softlayer triton vsphere" node=dkron-server-794cd48678-lbbmg
time="2020-02-17T10:14:04Z" level=info msg="agent: Joining cluster..." cluster=LAN node=dkron-server-794cd48678-lbbmg
time="2020-02-17T10:14:04Z" level=info msg="2020/02/17 10:14:04 discover.go:176: [DEBUG] discover: Using provider \"k8s\""
time="2020-02-17T10:14:04Z" level=info msg="2020/02/17 10:14:04 k8s_discover.go:138: [DEBUG] discover-k8s: ignoring pod \"dkron-server-794cd48678-4drnv\", not running: \"Pending\""
time="2020-02-17T10:14:04Z" level=info msg="2020/02/17 10:14:04 k8s_discover.go:138: [DEBUG] discover-k8s: ignoring pod \"dkron-server-794cd48678-lbbmg\", not running: \"Pending\""
time="2020-02-17T10:14:04Z" level=info msg="2020/02/17 10:14:04 k8s_discover.go:138: [DEBUG] discover-k8s: ignoring pod \"dkron-server-794cd48678-mgww8\", not running: \"Pending\""
time="2020-02-17T10:14:04Z" level=info msg="agent: Discovered LAN servers: " node=dkron-server-794cd48678-lbbmg
time="2020-02-17T10:14:04Z" level=warning msg="agent: Join LAN failed: No servers to join, retrying in 30s" node=dkron-server-794cd48678-lbbmg
time="2020-02-17T10:14:34Z" level=info msg="2020/02/17 10:14:34 discover.go:176: [DEBUG] discover: Using provider \"k8s\""
time="2020-02-17T10:14:34Z" level=info msg="agent: Discovered LAN servers: 172.25.8.124 172.25.7.82 172.25.5.13" node=dkron-server-794cd48678-lbbmg
time="2020-02-17T10:14:34Z" level=info msg="agent: Join LAN completed. Synced with 3 initial agents" node=dkron-server-794cd48678-lbbmg
time="2020-02-17T10:14:34Z" level=info msg="All 0 tables opened in 0s\n" node=dkron-server-794cd48678-lbbmg
time="2020-02-17T10:14:34Z" level=info msg="DropAll called. Blocking writes..." node=dkron-server-794cd48678-lbbmg
time="2020-02-17T10:14:34Z" level=info msg="Writes flushed. Stopping compactions now..." node=dkron-server-794cd48678-lbbmg
time="2020-02-17T10:14:34Z" level=info msg="Deleted 0 SSTables. Now deleting value logs...\n" node=dkron-server-794cd48678-lbbmg
time="2020-02-17T10:14:34Z" level=info msg="Value logs deleted. Creating value log file: 0" node=dkron-server-794cd48678-lbbmg
time="2020-02-17T10:14:34Z" level=info msg="Deleted 1 value log files. DropAll done.\n" node=dkron-server-794cd48678-lbbmg
time="2020-02-17T10:14:34Z" level=info msg="Resuming writes" node=dkron-server-794cd48678-lbbmg
time="2020-02-17T10:14:34Z" level=info msg="api: Running HTTP server" address=":8080" node=dkron-server-794cd48678-lbbmg
time="2020-02-17T10:14:34Z" level=info msg="dkron: monitoring leadership" node=dkron-server-794cd48678-lbbmg
time="2020-02-17T10:14:34Z" level=info msg="agent: Listen for events" node=dkron-server-794cd48678-lbbmg
time="2020-02-17T10:14:34Z" level=info msg="agent: Received event" event=member-join node=dkron-server-794cd48678-lbbmg
time="2020-02-17T10:14:34Z" level=info msg="adding server" node=dkron-server-794cd48678-lbbmg server=dkron-server-794cd48678-lbbmg
time="2020-02-17T10:14:34Z" level=info msg="agent: found expected number of peers, attempting to bootstrap cluster...peers172.25.7.82:6868,172.25.8.124:6868,172.25.5.13:6868" node=dkron-server-794cd48678-lbbmg
time="2020-02-17T10:14:35Z" level=info msg="dkron: cluster leadership acquired" node=dkron-server-794cd48678-lbbmg
time="2020-02-17T10:14:35Z" level=info msg="dkron: monitoring leadership" node=dkron-server-794cd48678-lbbmg
time="2020-02-17T10:14:35Z" level=info msg="agent: Starting scheduler" node=dkron-server-794cd48678-lbbmg
time="2020-02-17T10:14:35Z" level=info msg="agent: Received event" event=member-update node=dkron-server-794cd48678-lbbmg

dkron-server-794cd48678-mgww8 - this is a Leader

time="2020-02-17T10:14:03Z" level=info msg="No valid config found: Applying default values." error="Config File \"dkron\" Not Found in \"[/etc/dkron /root/.dkron /opt/dkron/bin/config]\""
time="2020-02-17T10:14:04Z" level=info msg="agent: Dkron agent starting" node=dkron-server-794cd48678-mgww8
time="2020-02-17T10:14:04Z" level=info msg="agent: Retry join LAN is supported for: aliyun aws azure digitalocean gce k8s linode mdns os packet scaleway softlayer triton vsphere" node=dkron-server-794cd48678-mgww8
time="2020-02-17T10:14:04Z" level=info msg="agent: Joining cluster..." cluster=LAN node=dkron-server-794cd48678-mgww8
time="2020-02-17T10:14:04Z" level=info msg="2020/02/17 10:14:04 discover.go:176: [DEBUG] discover: Using provider \"k8s\""
time="2020-02-17T10:14:04Z" level=info msg="2020/02/17 10:14:04 k8s_discover.go:138: [DEBUG] discover-k8s: ignoring pod \"dkron-server-794cd48678-4drnv\", not running: \"Pending\""
time="2020-02-17T10:14:04Z" level=info msg="2020/02/17 10:14:04 k8s_discover.go:138: [DEBUG] discover-k8s: ignoring pod \"dkron-server-794cd48678-lbbmg\", not running: \"Pending\""
time="2020-02-17T10:14:04Z" level=info msg="2020/02/17 10:14:04 k8s_discover.go:138: [DEBUG] discover-k8s: ignoring pod \"dkron-server-794cd48678-mgww8\", not running: \"Pending\""
time="2020-02-17T10:14:04Z" level=info msg="agent: Discovered LAN servers: " node=dkron-server-794cd48678-mgww8
time="2020-02-17T10:14:04Z" level=warning msg="agent: Join LAN failed: No servers to join, retrying in 30s" node=dkron-server-794cd48678-mgww8
time="2020-02-17T10:14:34Z" level=info msg="2020/02/17 10:14:34 discover.go:176: [DEBUG] discover: Using provider \"k8s\""
time="2020-02-17T10:14:34Z" level=info msg="agent: Discovered LAN servers: 172.25.8.124 172.25.7.82 172.25.5.13" node=dkron-server-794cd48678-mgww8
time="2020-02-17T10:14:34Z" level=info msg="agent: Join LAN completed. Synced with 3 initial agents" node=dkron-server-794cd48678-mgww8
time="2020-02-17T10:14:34Z" level=info msg="All 0 tables opened in 0s\n" node=dkron-server-794cd48678-mgww8
time="2020-02-17T10:14:34Z" level=info msg="DropAll called. Blocking writes..." node=dkron-server-794cd48678-mgww8
time="2020-02-17T10:14:34Z" level=info msg="Writes flushed. Stopping compactions now..." node=dkron-server-794cd48678-mgww8
time="2020-02-17T10:14:34Z" level=info msg="Deleted 0 SSTables. Now deleting value logs...\n" node=dkron-server-794cd48678-mgww8
time="2020-02-17T10:14:34Z" level=info msg="Value logs deleted. Creating value log file: 0" node=dkron-server-794cd48678-mgww8
time="2020-02-17T10:14:34Z" level=info msg="Deleted 1 value log files. DropAll done.\n" node=dkron-server-794cd48678-mgww8
time="2020-02-17T10:14:34Z" level=info msg="Resuming writes" node=dkron-server-794cd48678-mgww8
time="2020-02-17T10:14:34Z" level=info msg="api: Running HTTP server" address=":8080" node=dkron-server-794cd48678-mgww8
time="2020-02-17T10:14:34Z" level=info msg="dkron: monitoring leadership" node=dkron-server-794cd48678-mgww8
time="2020-02-17T10:14:34Z" level=info msg="agent: Listen for events" node=dkron-server-794cd48678-mgww8
time="2020-02-17T10:14:34Z" level=info msg="agent: Received event" event=member-join node=dkron-server-794cd48678-mgww8
time="2020-02-17T10:14:34Z" level=info msg="adding server" node=dkron-server-794cd48678-mgww8 server=dkron-server-794cd48678-mgww8
time="2020-02-17T10:14:35Z" level=info msg="agent: Received event" event=member-update node=dkron-server-794cd48678-mgww8
time="2020-02-17T10:16:42Z" level=info msg="agent: Received event" event=member-update node=dkron-server-794cd48678-mgww8
time="2020-02-17T10:17:38Z" level=info msg="dkron: cluster leadership acquired" node=dkron-server-794cd48678-mgww8
time="2020-02-17T10:17:38Z" level=info msg="dkron: monitoring leadership" node=dkron-server-794cd48678-mgww8
time="2020-02-17T10:17:38Z" level=info msg="agent: Starting scheduler" node=dkron-server-794cd48678-mgww8
time="2020-02-17T10:18:58Z" level=info msg="agent: Received event" event=member-update node=dkron-server-794cd48678-mgww8
time="2020-02-17T10:19:38Z" level=error msg="dkron: failed to add raft peer" error="leadership lost while committing log" node=dkron-server-794cd48678-mgww8
time="2020-02-17T10:19:38Z" level=error msg="failed to reconcile member" error="leadership lost while committing log" member="{dkron-server-794cd48678-rjc7f 172.25.2.183 8946 map[dc:dc1 expect:3 port:6868 region:global role:dkron rpc_addr:172.25.2.183:6868 server:true version:devel] alive 1 5 2 2 5 4}" node=dkron-server-794cd48678-mgww8
time="2020-02-17T10:19:38Z" level=error msg="dkron: failed to reconcile" error="leadership lost while committing log" node=dkron-server-794cd48678-mgww8
time="2020-02-17T10:19:38Z" level=info msg="dkron: cluster leadership lost" node=dkron-server-794cd48678-mgww8
time="2020-02-17T10:19:38Z" level=info msg="dkron: monitoring leadership" node=dkron-server-794cd48678-mgww8
time="2020-02-17T10:19:39Z" level=info msg="dkron: cluster leadership acquired" node=dkron-server-794cd48678-mgww8
time="2020-02-17T10:19:39Z" level=info msg="dkron: monitoring leadership" node=dkron-server-794cd48678-mgww8
time="2020-02-17T10:19:39Z" level=info msg="agent: Starting scheduler" node=dkron-server-794cd48678-mgww8

As we can see, there are errors on leader and message that cluster leadership is lot and then immediately acquired back. Also I noticed, that when new node joins to cluster, it's not visible in "raft list-peers". Though it becomes visible after this failover. Simple "worker" agents are joining to cluster without any issues.

Expected behavior New server containers are joining to Dkron cluster without triggering failover and re-electing leader. In Kubernetes when a containers fails by any reason, it's restarted automatically by Kubernetes.

Specifications:

andreygolev commented 4 years ago

Seems that it's related to the issue which Consul had some time before. This PR should fixed the issue for Consul: https://github.com/hashicorp/consul/pull/2319/files#diff-316ef07c64533d6ab7ec6f83189a29e3

But looks like it's not that straightforward to merge into Dkron :)

andreygolev commented 4 years ago

I've created a workaround this by creating a bash script that checks if there is a cluster created already. If cluster is already there, then run agent without --bootstrap-expect option and it launches just fine.