Closed ahjing99 closed 5 months ago
This issue has been marked as stale because it has been open for 30 days with no activity
This issue cannot be fixed under the arch of sentinel + redis, when network fault is injected to master pod, both sentinel and redis slave cannot approach the master, so comes up with a network partition, sentinel detects the failure and promotes slave to new master, but some data are written successfully in the old master during the partition time, it is a common case in network partition. But the dual primary/master needs to be fixed ASAP.
For dual primary/master, it can be fixed in the way like Patroni for PostgreSQL, sentinel always keeps the fresh and right info about the cluster, when a failover is done, a role change event can be emitted by sentinel to the lorry sidecar, and the message is passed to KB controller, finally, the partitioned primary pod label is rectified, and the services referring to the 'primary' label also come to a consistent state. During the dual primary phase, some writes from client routed to partitioned primary pod will fail and get reply with 'You can't write against a read only replica'.
kbcli version Kubernetes: v1.27.3-gke.100 KubeBlocks: 0.7.0-alpha.8 kbcli: 0.7.0-alpha.8
Steps:
NetworkChaos network-chaos-65g9m created
➜ ~ kbcli cluster connect cluster-oqroov Connect to instance cluster-oqroov-redis-0: out of cluster-oqroov-redis-0, cluster-oqroov-redis-1 Warning: Using a password with '-a' or '-u' option on the command line interface may not be safe. 127.0.0.1:6379> get mykey "4" 127.0.0.1:6379> get mykey "5" 127.0.0.1:6379> get mykey "5" 127.0.0.1:6379> get mykey "7"
kbcli cluster describe cluster-oqroov Name: cluster-oqroov Created Time: Sep 13,2023 09:49 UTC+0800 NAMESPACE CLUSTER-DEFINITION VERSION STATUS TERMINATION-POLICY default redis redis-7.0.6 ConditionsError WipeOut
Endpoints: COMPONENT MODE INTERNAL EXTERNAL redis ReadWrite cluster-oqroov-redis.default.svc.cluster.local:6379
redis-sentinel ReadWrite cluster-oqroov-redis-sentinel.default.svc.cluster.local:26379
Topology: COMPONENT INSTANCE ROLE STATUS AZ NODE CREATED-TIME redis cluster-oqroov-redis-0 primary Running us-central1-c gke-yjtest-default-pool-47e27321-mvr4/10.128.15.201 Sep 13,2023 10:16 UTC+0800 redis cluster-oqroov-redis-1 primary Running us-central1-c gke-yjtest-default-pool-47e27321-rbkc/10.128.15.202 Sep 13,2023 09:49 UTC+0800 redis-sentinel cluster-oqroov-redis-sentinel-0 Running us-central1-c gke-yjtest-default-pool-47e27321-mvr4/10.128.15.201 Sep 13,2023 09:49 UTC+0800
redis-sentinel cluster-oqroov-redis-sentinel-1 Running us-central1-c gke-yjtest-default-pool-47e27321-h6tl/10.128.15.203 Sep 13,2023 09:50 UTC+0800
redis-sentinel cluster-oqroov-redis-sentinel-2 Running us-central1-c gke-yjtest-default-pool-47e27321-rbkc/10.128.15.202 Sep 13,2023 09:50 UTC+0800
Resources Allocation: COMPONENT DEDICATED CPU(REQUEST/LIMIT) MEMORY(REQUEST/LIMIT) STORAGE-SIZE STORAGE-CLASS redis false 500m / 500m 1Gi / 1Gi data:5Gi kb-default-sc redis-sentinel false 500m / 500m 1Gi / 1Gi data:5Gi kb-default-sc
Images: COMPONENT TYPE IMAGE redis redis registry.cn-hangzhou.aliyuncs.com/apecloud/redis-stack-server:7.0.6-RC8 redis-sentinel redis-sentinel registry.cn-hangzhou.aliyuncs.com/apecloud/redis-stack-server:7.0.6-RC8
Data Protection: AUTO-BACKUP BACKUP-SCHEDULE TYPE BACKUP-TTL LAST-SCHEDULE RECOVERABLE-TIME Disabled 7d
Show cluster events: kbcli cluster list-events -n default cluster-oqroov
kbcli cluster describe cluster-oqroov Name: cluster-oqroov Created Time: Sep 13,2023 09:49 UTC+0800 NAMESPACE CLUSTER-DEFINITION VERSION STATUS TERMINATION-POLICY default redis redis-7.0.6 Running WipeOut
Endpoints: COMPONENT MODE INTERNAL EXTERNAL redis ReadWrite cluster-oqroov-redis.default.svc.cluster.local:6379
redis-sentinel ReadWrite cluster-oqroov-redis-sentinel.default.svc.cluster.local:26379
Topology: COMPONENT INSTANCE ROLE STATUS AZ NODE CREATED-TIME redis cluster-oqroov-redis-0 secondary Running us-central1-c gke-yjtest-default-pool-47e27321-mvr4/10.128.15.201 Sep 13,2023 10:16 UTC+0800 redis cluster-oqroov-redis-1 primary Running us-central1-c gke-yjtest-default-pool-47e27321-rbkc/10.128.15.202 Sep 13,2023 09:49 UTC+0800 redis-sentinel cluster-oqroov-redis-sentinel-0 Running us-central1-c gke-yjtest-default-pool-47e27321-mvr4/10.128.15.201 Sep 13,2023 09:49 UTC+0800
redis-sentinel cluster-oqroov-redis-sentinel-1 Running us-central1-c gke-yjtest-default-pool-47e27321-h6tl/10.128.15.203 Sep 13,2023 09:50 UTC+0800
redis-sentinel cluster-oqroov-redis-sentinel-2 Running us-central1-c gke-yjtest-default-pool-47e27321-rbkc/10.128.15.202 Sep 13,2023 09:50 UTC+0800
Resources Allocation: COMPONENT DEDICATED CPU(REQUEST/LIMIT) MEMORY(REQUEST/LIMIT) STORAGE-SIZE STORAGE-CLASS redis false 500m / 500m 1Gi / 1Gi data:5Gi kb-default-sc redis-sentinel false 500m / 500m 1Gi / 1Gi data:5Gi kb-default-sc
Images: COMPONENT TYPE IMAGE redis redis registry.cn-hangzhou.aliyuncs.com/apecloud/redis-stack-server:7.0.6-RC8 redis-sentinel redis-sentinel registry.cn-hangzhou.aliyuncs.com/apecloud/redis-stack-server:7.0.6-RC8
Data Protection: AUTO-BACKUP BACKUP-SCHEDULE TYPE BACKUP-TTL LAST-SCHEDULE RECOVERABLE-TIME Disabled 7d
Show cluster events: kbcli cluster list-events -n default cluster-oqroov
➜ ~ kbcli cluster connect cluster-oqroov Connect to instance cluster-oqroov-redis-0: out of cluster-oqroov-redis-0, cluster-oqroov-redis-1 Warning: Using a password with '-a' or '-u' option on the command line interface may not be safe. 127.0.0.1:6379> get mykey "4" 127.0.0.1:6379> get mykey "5" 127.0.0.1:6379> get mykey "5" 127.0.0.1:6379> get mykey "7" 127.0.0.1:6379> get mykey Error: Server closed the connection not connected> get mykey "1" 127.0.0.1:6379>