apecloud / kubeblocks-addons

KubeBlocks add-ons.
Apache License 2.0
35 stars 38 forks source link

[BUG] Redis cluster nodes fail to join cluster due to FQDN DNS resolving lag #906

Open leonliao opened 3 months ago

leonliao commented 3 months ago

Describe the bug For Redis Custer, current addons/redis/redis-cluster-scripts/redis-cluster-server-start.sh uses FQDN to add a node to cluster. But after a redis node pod rebuild or creation, due to the DNS cached entry refreshed after the cluster add-node command or the new FQDN DNS entry being resolvable after the command , it is possible that the cluster joining could fail.

To Reproduce

Simulate a pod leaving the cluster and rejoin.

  1. Login a slave pod in Redis Cluster, execute the redis-cli --cluster del-node $current_node_ip_and_port $current_node_cluster_id, simulating addons/redis/redis-cluster-scripts/redis-cluster-replica-member-leave.sh
  2. Delete the slave pod
  3. There could be chances that seeing errors like below:

DNS staled cache points the FQDN to old POD

DNS taking effect after the add-node command


**Expected behavior**
Retry more times to add-node until succeeded or for enough time for DNS to take effect.

**Screenshots**
NA

**Desktop (please complete the following information):**
- OS: MacOS
- Version: kubeblocks-0.9.0
weicao commented 3 months ago

I see. In your tests, if you delete the slave Pod of a shard, a new Pod will be created to join the shard. However, due to the time it takes for DNS to become effective, the redis-cli add-node command may result in an error.

leonliao commented 2 months ago

I see. In your tests, if you delete the slave Pod of a shard, a new Pod will be created to join the shard. However, due to the time it takes for DNS to become effective, the redis-cli add-node command may result in an error.

Yes. I think all addons using FQDN to identify nodes should be reviewed, to check whether addons are having the same issue .

weicao commented 2 months ago

I see. In your tests, if you delete the slave Pod of a shard, a new Pod will be created to join the shard. However, due to the time it takes for DNS to become effective, the redis-cli add-node command may result in an error.

Yes. I think all addons using FQDN to identify nodes should be reviewed, to check whether addons are having the same issue .

Sure, this is a great suggestion to us.