leonliao commented 3 months ago

Describe the bug For Redis Custer, current addons/redis/redis-cluster-scripts/redis-cluster-server-start.sh uses FQDN to add a node to cluster. But after a redis node pod rebuild or creation, due to the DNS cached entry refreshed after the cluster add-node command or the new FQDN DNS entry being resolvable after the command , it is possible that the cluster joining could fail.

To Reproduce

Simulate a pod leaving the cluster and rejoin.

Login a slave pod in Redis Cluster, execute the redis-cli --cluster del-node $current_node_ip_and_port $current_node_cluster_id, simulating addons/redis/redis-cluster-scripts/redis-cluster-replica-member-leave.sh
Delete the slave pod
There could be chances that seeing errors like below:

DNS staled cache points the FQDN to old POD

current_node_with_port=redis-cluster-shard-kqr-0.redis-cluster-shard-kqr-headless.default.svc:6379
set +x scale out replica replicated command: redis-cli --cluster add-node redis-cluster-shard-kqr-0.redis-cluster-shard-kqr-headless.default.svc:6379 redis-cluster-shard-kqr-1.redis-cluster-shard-kqr-headless.default.svc:6379 --cluster-slave --cluster-master-id 587b870ce5809309004f785d0837ed11f82a33e5 -a **** Warning: Using a password with '-a' or '-u' option on the command line interface may not be safe. Could not connect to Redis at redis-cluster-shard-kqr-0.redis-cluster-shard-kqr-headless.default.svc:6379: Connection timed out

DNS taking effect after the add-node command

current_node_with_port=redis-cluster-shard-btn-0.redis-cluster-shard-btn-headless.default.svc:6379
set +x scale out replica replicated command: redis-cli --cluster add-node redis-cluster-shard-btn-0.redis-cluster-shard-btn-headless.default.svc:6379 redis-cluster-shard-btn-1.redis-cluster-shard-btn-headless.default.svc:6379 --cluster-slave --cluster-master-id 75203464beaf7403fc82eeeb24c98f9a9590a054 -a **** Warning: Using a password with '-a' or '-u' option on the command line interface may not be safe. Could not connect to Redis at redis-cluster-shard-btn-0.redis-cluster-shard-btn-headless.default.svc:6379: Name or service not known


**Expected behavior**
Retry more times to add-node until succeeded or for enough time for DNS to take effect.

**Screenshots**
NA

**Desktop (please complete the following information):**
- OS: MacOS
- Version: kubeblocks-0.9.0

weicao commented 3 months ago

I see. In your tests, if you delete the slave Pod of a shard, a new Pod will be created to join the shard. However, due to the time it takes for DNS to become effective, the redis-cli add-node command may result in an error.

leonliao commented 2 months ago

I see. In your tests, if you delete the slave Pod of a shard, a new Pod will be created to join the shard. However, due to the time it takes for DNS to become effective, the redis-cli add-node command may result in an error.

Yes. I think all addons using FQDN to identify nodes should be reviewed, to check whether addons are having the same issue .

weicao commented 2 months ago

I see. In your tests, if you delete the slave Pod of a shard, a new Pod will be created to join the shard. However, due to the time it takes for DNS to become effective, the redis-cli add-node command may result in an error.

Yes. I think all addons using FQDN to identify nodes should be reviewed, to check whether addons are having the same issue .

Sure, this is a great suggestion to us.

apecloud / kubeblocks-addons

[BUG] Redis cluster nodes fail to join cluster due to FQDN DNS resolving lag #906

DNS staled cache points the FQDN to old POD

DNS taking effect after the add-node command