Open fopina opened 1 year ago
Details pushed to https://github.com/fopina/delme/tree/main/dkron_retry_join_does_not_reresolve_dns to make reproducing easier
I'd assume with a HA / multi-server setup this won't happen as the server changing IP will retry-join by himself onto the other servers (and then new IP will be shared among everyone), but I haven't tested, and I think it still makes this a valid bug as single-server setup is documented.
after taking a look at dkron/retry_join.go
, I think the agents are not stuck in a retry loop with an outdated IP, they're not retrying at all. So that is not the right place to fix it...
But looking at agent -h
, this is an interesting option
--serf-reconnect-timeout string This is the amount of time to attempt to reconnect to a failed node before
giving up and considering it completely gone. In Kubernetes, you might need
this to about 5s, because there is no reason to try reconnects for default
24h value. Also Raft behaves oddly if node is not reaped and returned with
same ID, but different IP.
But now the agents will "see" the server, but they do not retry join
agents_3 | time="2023-01-23T00:22:28Z" level=info msg="removing server dkron1 (Addr: 10.5.0.20:6868) (DC: dc1)" node=e9eca2dc43f1
...
agents_3 | time="2023-01-23T00:22:44Z" level=info msg="agent: Received event" event=member-update node=e9eca2dc43f1
agents_3 | time="2023-01-23T00:22:44Z" level=info msg="agent: Received event" event=member-reap node=e9eca2dc43f1
@yvanoers just in case you're still around, would you have any comment on this one? I've tried debugging but I believe part that is handling reconnection (and not re-resolving DNS) is within serf library, not dkron.
I couldn't find any workaround at all... I've tried modifying -serf-reconnect-timeout
to a low value (as recommended for kubernetes) but then it's even worse as the agents remove the server and never see it again (even if it comes back up with same IP)
I'm not that well-versed in the internals of serf, but you could very well be right that this is a serf-related issue. Maybe @vcastellm has more readily available knowledge, I would have to dig into it - which I am willing to, except my available time is rather sparse lately.
This is an old one, known issue, it's because how Raft is handling nodes, it's affecting any dynamic IP system like k8s and it should be fixable. I need to dig into it, it's something really annoying, so expect that I'll try to allocate time for this soon.
That's awesome! I did an attempt to trace it but failed...
Even when trying with multiple servers as the workaround I mentioned, it still doesn't work. Using the low serf reconnect timeout, kicks server out and never allows it back it..
I took a deeper look into this, it's not related to Raft but to what you mentioned, Serf is not resolving the hostname but using the existing IP, it's always DNS :)
I need to investigate a bit more to come up with a workaround that doesn't involve restarting the agents.
Gentle reminder this is still happening 🩸 😄
Hi, we have the same issue (dns name not used by raft, it uses ip adresses that in k8s are changing unexpectedly), after pod are restarted, the new ip adresses are not taken in account by raft .
Did anyone found a solution to this ??
I don't and it's really annoying. I ended up setting log alerts (as I have logs in Loki) and kill all agents when issue starts popping up...
Really bad workaround but, in my case, I prefer to break some ongoing jobs than not running any until I manually restarted...
Thanks @fopina for your reply ! I have a question for you: but with a simple kill of the agent you are able to stabilize the dkron cluster, reatining all the data (schedules) ?? I don't understand how this can happen ...
Agents have no data, it's all in the server(s). Killing those agents makes them restart and re-resolve server hostnames. The impact is that any job running there, fails (gets killed as well)
@fopina can you check against v4-beta?
I already did @vcastellm : https://github.com/distribworks/dkron/issues/1442#issuecomment-1937762647
It didn't work though :/
Hi, we tried dkron/dkron:4.0.0-beta4 on an aks cluster, with 3 server nodes. Various restarts of the nodes, always resulted in a working cluster with an elected leader. So the issue seems to be finally solved !
@jaccky @vcastellm maybe that is what the other issue/PR refer to (missing leader elections), though this issue I opened is not about leader.
I have single server setup and if the container restarts, the worker nodes will not reconnect (as server changed IP but not hostname)
the server itself comes back up and resumes as leader (as single node). And that, I tested and wasn’t fixed in beta4.
It sounds similar, but maybe it’s in slightly different place of the code? (As one is about server nodes reconnecting to the one that changed IP and mine is about worker nodes reconnecting to the same server that changed name) Seems like forcing name resolution should be the solution as well but maybe in other code path
@fopina Hey there! Have you faced any issues when running more than one dkron servers?
AFAIK, retry join is a finite process in dkron. Here's what typically happens when deploying dkron in such a configuration:
While a DNS solution might work, there could be other approaches to consider. For example, if the agent receives a server leave event and there are no known dkron server nodes, it could initiate a retry-joining process on the dkron agent.
I'm not very familiar with the dkron backend, so I'd like to ask @vcastellm to validate this information.
Hi @ivan-kripakov-m10.
I believe that is not correct, the nodes do keep trying to rejoin at serf layer but only keep resolved IP, they do not re-resolve.
Relating to multiple server nodes, yes, I used to run a 3 server node cluster, but the leader election / raft issues were so frequent that HA setup had more downtime than single server node hehe Also, as my single server node is a service in swarm cluster, if the host goes down, it's reassigned to other node, very little down time. I just need to resolve the rejoin of workers hehe
@fopina thanks for reply!
Just to clarify, retry join is not a feature of the serf layer itself. Instead, it's an abstraction within dkron. You can find the implementation details in the dkron source code at this link: retry_join.
This method is invoked only when a dkron server or agent starts up
So, I reproduced the issue in k8s environment. I initiated one dkron server and one dkron agent, then removed the retry join property from the dkron server configuration. Here's how the configuration looked:
- "--retry-join=\"provider=k8s label_selector=\"\"app.kubernetes.io/instance={{ .Release.Name }}\"\" namespace=\"\"{{ .Release.Namespace }}\"\"\""
After removing the retry join property and restarting the dkron server, the dkron agent produced the following logs (like yours):
time="2024-02-18T16:29:10Z" level=info msg="agent: Received event" event=member-leave node=dkron-agent-5ffc84b448-4ft7b
time="2024-02-18T16:29:10Z" level=info msg="removing server dkron-server-0 (Addr: 10.0.0.20:6868) (DC: dc1)" node=dkron-agent-5ffc84b448-4ft7b
The issue is not reproducible when the retry join property is present in the dkron server configuration. With this property dkron server is able to discover the dkron agent. Consequently, the dkron agent simply receives an update event rather than only a member leave event. Below are the logs from the dkron agent:
time="2024-02-18T16:25:11Z" level=info msg="removing server dkron-server-0 (Addr: 10.0.4.97:6868) (DC: dc1)" node=dkron-agent-5ffc84b448-4ft7b
time="2024-02-18T16:25:24Z" level=info msg="agent: Received event" event=member-update node=dkron-agent-5ffc84b448-4ft7b
time="2024-02-18T16:25:24Z" level=info msg="Updating LAN server" node=dkron-agent-5ffc84b448-4ft7b server="dkron-server-0 (Addr: 10.0.3.155:6868) (DC: dc1)"
It appears that you can try adding the dkron-agent DNS name to the retry-join configuration in the dkron-server as a workaround.
@ivan-kripakov-m10 could you highlight the differences of your test with the configuration I posted in the issue itself?
It's using retry-join and DNS name. Maybe it has been fixed in v4 indeed and I tested it wrong this time
@fopina no, the issue itself is not fixed in v4 yet :(
I'm suggesting a workaround with adding DKRON_RETRY_JOIN
with dkron-agents' hosts to dkron-server configuration.
services.server.environment.DKRON_RETRY_JOIN: {{dkron-agents-dns-names}}
@ivan-kripakov-m10 oh got it! Good point, I’ll test in my setup, might be worth it even if it causes some network “noise”!
So, I did a bit of digging into how serf works and if we can use DNS names with it. Here's what I found:
At first glance it seems that we can't solve this problem in the serf layer and have to implement something within dkron.
@ivan-kripakov-m10 thank you very much!
As I'm using docker swarm, adding DKRON_RETRY_JOIN: tasks.agents
to the server service was enough!
tasks.agents
resolves to ONE OF the healthy replicas and apparently that's enough as the replicas are still connected amongst them and cluster membership is updated in all of them!
@vcastellm I think this issue still makes sense (as agents DO retry to join but without re-resolving hostname, so looks like a bug) but feel free to close it, Ivan's workaround is more than acceptable
Describe the bug
After both server and agents are up and cluster is running smoothly, if the server goes down and comes back up with a different IP (but same hostname), agents do not reconnect.
To Reproduce
docker-compose.yml
x-common: &base image: dkron/dkron:3.2.1 command: agent
networks: vpcbr: ipam: config:
services: server: <<: *base environment: DKRON_DATA_DIR: /ext/data DKRON_SERVER: 1 DKRON_NODE_NAME: dkron1 DKRON_BOOTSTRAP_EXPECT: 1 ports:
8888:8080 networks: vpcbr: ipv4_address: 10.5.0.20
agents: <<: *base environment: DKRON_RETRY_JOIN: server networks: vpcbr: deploy: replicas: 3
Expected behavior Agents would eventually retry joining on hostname, picking up the new IP.
Additional context I understand serf or raft might be tricky with DNS but in this case, server does start up with proper access to data/log, no corruption. And if I restart the agents, they will reconnect just fine. It seems it's just that retry will go on using IP after first join, instead of re-resolving hostname.
To reproduce the issue, I'm forcing the IP change here, but when running in docker swarm (and I assume in k8s as well) new IP upon service re-creation is expected without using fixed IPs.
Is this something easy to fix?