distribworks / dkron

Dkron - Distributed, fault tolerant job scheduling system https://dkron.io
GNU Lesser General Public License v3.0
4.31k stars 380 forks source link

RETRY_JOIN fails after server comes back up - it's always DNS! #1253

Open fopina opened 1 year ago

fopina commented 1 year ago

Describe the bug

After both server and agents are up and cluster is running smoothly, if the server goes down and comes back up with a different IP (but same hostname), agents do not reconnect.

To Reproduce

x-common: &base image: dkron/dkron:3.2.1 command: agent

networks: vpcbr: ipam: config:

services: server: <<: *base environment: DKRON_DATA_DIR: /ext/data DKRON_SERVER: 1 DKRON_NODE_NAME: dkron1 DKRON_BOOTSTRAP_EXPECT: 1 ports:

Expected behavior Agents would eventually retry joining on hostname, picking up the new IP.

Additional context I understand serf or raft might be tricky with DNS but in this case, server does start up with proper access to data/log, no corruption. And if I restart the agents, they will reconnect just fine. It seems it's just that retry will go on using IP after first join, instead of re-resolving hostname.

To reproduce the issue, I'm forcing the IP change here, but when running in docker swarm (and I assume in k8s as well) new IP upon service re-creation is expected without using fixed IPs.

Is this something easy to fix?

fopina commented 1 year ago

Details pushed to https://github.com/fopina/delme/tree/main/dkron_retry_join_does_not_reresolve_dns to make reproducing easier

I'd assume with a HA / multi-server setup this won't happen as the server changing IP will retry-join by himself onto the other servers (and then new IP will be shared among everyone), but I haven't tested, and I think it still makes this a valid bug as single-server setup is documented.

fopina commented 1 year ago

after taking a look at dkron/retry_join.go, I think the agents are not stuck in a retry loop with an outdated IP, they're not retrying at all. So that is not the right place to fix it...

But looking at agent -h, this is an interesting option

      --serf-reconnect-timeout string   This is the amount of time to attempt to reconnect to a failed node before
                                        giving up and considering it completely gone. In Kubernetes, you might need
                                        this to about 5s, because there is no reason to try reconnects for default
                                        24h value. Also Raft behaves oddly if node is not reaped and returned with
                                        same ID, but different IP.

But now the agents will "see" the server, but they do not retry join

agents_3  | time="2023-01-23T00:22:28Z" level=info msg="removing server dkron1 (Addr: 10.5.0.20:6868) (DC: dc1)" node=e9eca2dc43f1
...
agents_3  | time="2023-01-23T00:22:44Z" level=info msg="agent: Received event" event=member-update node=e9eca2dc43f1
agents_3  | time="2023-01-23T00:22:44Z" level=info msg="agent: Received event" event=member-reap node=e9eca2dc43f1
fopina commented 1 year ago

@yvanoers just in case you're still around, would you have any comment on this one? I've tried debugging but I believe part that is handling reconnection (and not re-resolving DNS) is within serf library, not dkron.

I couldn't find any workaround at all... I've tried modifying -serf-reconnect-timeout to a low value (as recommended for kubernetes) but then it's even worse as the agents remove the server and never see it again (even if it comes back up with same IP)

yvanoers commented 1 year ago

I'm not that well-versed in the internals of serf, but you could very well be right that this is a serf-related issue. Maybe @vcastellm has more readily available knowledge, I would have to dig into it - which I am willing to, except my available time is rather sparse lately.

vcastellm commented 1 year ago

This is an old one, known issue, it's because how Raft is handling nodes, it's affecting any dynamic IP system like k8s and it should be fixable. I need to dig into it, it's something really annoying, so expect that I'll try to allocate time for this soon.

fopina commented 1 year ago

That's awesome! I did an attempt to trace it but failed...

Even when trying with multiple servers as the workaround I mentioned, it still doesn't work. Using the low serf reconnect timeout, kicks server out and never allows it back it..

vcastellm commented 1 year ago

I took a deeper look into this, it's not related to Raft but to what you mentioned, Serf is not resolving the hostname but using the existing IP, it's always DNS :)

I need to investigate a bit more to come up with a workaround that doesn't involve restarting the agents.

fopina commented 1 year ago

Gentle reminder this is still happening 🩸 😄

jaccky commented 1 year ago

Hi, we have the same issue (dns name not used by raft, it uses ip adresses that in k8s are changing unexpectedly), after pod are restarted, the new ip adresses are not taken in account by raft .

Did anyone found a solution to this ??

fopina commented 1 year ago

I don't and it's really annoying. I ended up setting log alerts (as I have logs in Loki) and kill all agents when issue starts popping up...

Really bad workaround but, in my case, I prefer to break some ongoing jobs than not running any until I manually restarted...

jaccky commented 1 year ago

Thanks @fopina for your reply ! I have a question for you: but with a simple kill of the agent you are able to stabilize the dkron cluster, reatining all the data (schedules) ?? I don't understand how this can happen ...

fopina commented 1 year ago

Agents have no data, it's all in the server(s). Killing those agents makes them restart and re-resolve server hostnames. The impact is that any job running there, fails (gets killed as well)

vcastellm commented 9 months ago

@fopina can you check against v4-beta?

fopina commented 9 months ago

I already did @vcastellm : https://github.com/distribworks/dkron/issues/1442#issuecomment-1937762647

It didn't work though :/

jaccky commented 8 months ago

Hi, we tried dkron/dkron:4.0.0-beta4 on an aks cluster, with 3 server nodes. Various restarts of the nodes, always resulted in a working cluster with an elected leader. So the issue seems to be finally solved !

fopina commented 8 months ago

@jaccky @vcastellm maybe that is what the other issue/PR refer to (missing leader elections), though this issue I opened is not about leader.

I have single server setup and if the container restarts, the worker nodes will not reconnect (as server changed IP but not hostname)

the server itself comes back up and resumes as leader (as single node). And that, I tested and wasn’t fixed in beta4.

It sounds similar, but maybe it’s in slightly different place of the code? (As one is about server nodes reconnecting to the one that changed IP and mine is about worker nodes reconnecting to the same server that changed name) Seems like forcing name resolution should be the solution as well but maybe in other code path

ivan-kripakov-m10 commented 8 months ago

@fopina Hey there! Have you faced any issues when running more than one dkron servers?

AFAIK, retry join is a finite process in dkron. Here's what typically happens when deploying dkron in such a configuration:

  1. Your dkron agent successfully joins the cluster and starts listening to serf events.
  2. If the server is killed, the agent receives a member leave event, but no rejoin process is initiated.
  3. When you deploy a new dkron server node with the same ID but a different IP, the agent does not retry joining in the serf layer, and the dkron server doesn't attempt to find agents and join them to its own serf cluster.

While a DNS solution might work, there could be other approaches to consider. For example, if the agent receives a server leave event and there are no known dkron server nodes, it could initiate a retry-joining process on the dkron agent.

I'm not very familiar with the dkron backend, so I'd like to ask @vcastellm to validate this information.

fopina commented 8 months ago

Hi @ivan-kripakov-m10.

I believe that is not correct, the nodes do keep trying to rejoin at serf layer but only keep resolved IP, they do not re-resolve.

Relating to multiple server nodes, yes, I used to run a 3 server node cluster, but the leader election / raft issues were so frequent that HA setup had more downtime than single server node hehe Also, as my single server node is a service in swarm cluster, if the host goes down, it's reassigned to other node, very little down time. I just need to resolve the rejoin of workers hehe

ivan-kripakov-m10 commented 8 months ago

@fopina thanks for reply!

Just to clarify, retry join is not a feature of the serf layer itself. Instead, it's an abstraction within dkron. You can find the implementation details in the dkron source code at this link: retry_join.

This method is invoked only when a dkron server or agent starts up

ivan-kripakov-m10 commented 8 months ago

So, I reproduced the issue in k8s environment. I initiated one dkron server and one dkron agent, then removed the retry join property from the dkron server configuration. Here's how the configuration looked:

- "--retry-join=\"provider=k8s label_selector=\"\"app.kubernetes.io/instance={{ .Release.Name }}\"\" namespace=\"\"{{ .Release.Namespace }}\"\"\""

After removing the retry join property and restarting the dkron server, the dkron agent produced the following logs (like yours):

time="2024-02-18T16:29:10Z" level=info msg="agent: Received event" event=member-leave node=dkron-agent-5ffc84b448-4ft7b
time="2024-02-18T16:29:10Z" level=info msg="removing server dkron-server-0 (Addr: 10.0.0.20:6868) (DC: dc1)" node=dkron-agent-5ffc84b448-4ft7b

The issue is not reproducible when the retry join property is present in the dkron server configuration. With this property dkron server is able to discover the dkron agent. Consequently, the dkron agent simply receives an update event rather than only a member leave event. Below are the logs from the dkron agent:

time="2024-02-18T16:25:11Z" level=info msg="removing server dkron-server-0 (Addr: 10.0.4.97:6868) (DC: dc1)" node=dkron-agent-5ffc84b448-4ft7b
time="2024-02-18T16:25:24Z" level=info msg="agent: Received event" event=member-update node=dkron-agent-5ffc84b448-4ft7b
time="2024-02-18T16:25:24Z" level=info msg="Updating LAN server" node=dkron-agent-5ffc84b448-4ft7b server="dkron-server-0 (Addr: 10.0.3.155:6868) (DC: dc1)"

It appears that you can try adding the dkron-agent DNS name to the retry-join configuration in the dkron-server as a workaround.

fopina commented 8 months ago

@ivan-kripakov-m10 could you highlight the differences of your test with the configuration I posted in the issue itself?

It's using retry-join and DNS name. Maybe it has been fixed in v4 indeed and I tested it wrong this time

ivan-kripakov-m10 commented 8 months ago

@fopina no, the issue itself is not fixed in v4 yet :( I'm suggesting a workaround with adding DKRON_RETRY_JOIN with dkron-agents' hosts to dkron-server configuration.

services.server.environment.DKRON_RETRY_JOIN: {{dkron-agents-dns-names}}
fopina commented 8 months ago

@ivan-kripakov-m10 oh got it! Good point, I’ll test in my setup, might be worth it even if it causes some network “noise”!

ivan-kripakov-m10 commented 8 months ago

So, I did a bit of digging into how serf works and if we can use DNS names with it. Here's what I found:

  1. Dkron uses serf.join method.
  2. Serf, in turn, hands off its tasks to the memberlist library (source).
  3. This library resolves IPs and carries on with them (source).

At first glance it seems that we can't solve this problem in the serf layer and have to implement something within dkron.

fopina commented 8 months ago

@ivan-kripakov-m10 thank you very much!

As I'm using docker swarm, adding DKRON_RETRY_JOIN: tasks.agents to the server service was enough! tasks.agents resolves to ONE OF the healthy replicas and apparently that's enough as the replicas are still connected amongst them and cluster membership is updated in all of them!

@vcastellm I think this issue still makes sense (as agents DO retry to join but without re-resolving hostname, so looks like a bug) but feel free to close it, Ivan's workaround is more than acceptable