apecloud / kubeblocks

KubeBlocks is an open-source control plane software that runs and manages databases, message queues and other stateful applications on K8s.
https://kubeblocks.io
GNU Affero General Public License v3.0
2.09k stars 170 forks source link

[BUG] kb-post-provision-job fails with "ERR Duplicate master name." #8114

Open MarkKharitonov opened 1 month ago

MarkKharitonov commented 1 month ago

Describe the bug

mark@L-R910LPKW:~$ k get pod
NAME                                                         READY   STATUS    RESTARTS   AGE
aida-dev-xyz-mining-redis-0                             3/3     Running   0          27h
aida-dev-xyz-mining-redis-1                             3/3     Running   0          27h
aida-dev-xyz-mining-redis-sentinel-0                    1/1     Running   0          27h
aida-dev-xyz-mining-redis-sentinel-1                    1/1     Running   0          27h
aida-dev-xyz-mining-redis-sentinel-2                    1/1     Running   0          27h
kb-post-provision-job-aida-dev-xyz-mining-redis-6l9gq   0/1     Error     0          3m
kb-post-provision-job-aida-dev-xyz-mining-redis-7qjhh   0/1     Error     0          3m25s
kb-post-provision-job-aida-dev-xyz-mining-redis-dcxt8   0/1     Error     0          3m41s
mark@L-R910LPKW:~$

To Reproduce Not sure, but for me it is reproduced very easily - I just need to delete the job to let it be created again and it errors out.

Expected behavior No errors.

Additional context I have 5 Redis instances deployed with KB, each with sentinels and each having 2 replicas for the database and 3 for the sentinels. Only one instance exhibits the problematic behavior:

mark@L-R910LPKW:~$ k get job
NAME                                                   STATUS   COMPLETIONS   DURATION   AGE
kb-post-provision-job-aida-dev-xyz-mining-redis   Failed   0/1           6m53s      6m53s
mark@L-R910LPKW:~$ k delete job --all
job.batch "kb-post-provision-job-aida-dev-xyz-mining-redis" deleted
mark@L-R910LPKW:~$ k get job
NAME                                                   STATUS    COMPLETIONS   DURATION   AGE
kb-post-provision-job-aida-dev-xyz-mining-redis   Running   0/1           2s         2s
mark@L-R910LPKW:~$ sleep 30
mark@L-R910LPKW:~$ k get job
NAME                                                   STATUS    COMPLETIONS   DURATION   AGE
kb-post-provision-job-aida-dev-xyz-mining-redis   Running   0/1           41s        41s
mark@L-R910LPKW:~$ k get pod
NAME                                                         READY   STATUS    RESTARTS   AGE
aida-dev-xyz-mining-redis-0                             3/3     Running   0          27h
aida-dev-xyz-mining-redis-1                             3/3     Running   0          27h
aida-dev-xyz-mining-redis-sentinel-0                    1/1     Running   0          27h
aida-dev-xyz-mining-redis-sentinel-1                    1/1     Running   0          27h
aida-dev-xyz-mining-redis-sentinel-2                    1/1     Running   0          27h
kb-post-provision-job-aida-dev-xyz-mining-redis-7dhpq   0/1     Error     0          6s
kb-post-provision-job-aida-dev-xyz-mining-redis-b89f8   0/1     Error     0          47s
kb-post-provision-job-aida-dev-xyz-mining-redis-ffmpj   0/1     Error     0          31s
mark@L-R910LPKW:~$ k logs kb-post-provision-job-aida-dev-xyz-mining-redis-b89f8
+ declare -g default_initialize_pod_ordinal
+ declare -g redis_advertised_svc_host_value
+ declare -g redis_advertised_svc_port_value
+ declare -g headless_postfix=headless
+ declare -g redis_default_service_port=6379
+ echo 'redis sentinel component replicas found, register to sentinel.'
+ register_to_sentinel_wrapper
+ '[' -z aida-dev-xyz-mining-redis-sentinel-0,aida-dev-xyz-mining-redis-sentinel-1,aida-dev-xyz-mining-redis-sentinel-2 ']'
+ '[' -z aida-dev-xyz-mining-redis-sentinel-headless ']'
+ get_minimum_initialize_pod_ordinal
+ '[' -z aida-dev-xyz-mining-redis-0,aida-dev-xyz-mining-redis-1 ']'
+ IFS=,
+ read -ra pod_list
+ for pod in "${pod_list[@]}"
+ '[' -z '' ']'
redis sentinel component replicas found, register to sentinel.
++ extract_ordinal_from_object_name aida-dev-xyz-mining-redis-0
++ local object_name=aida-dev-xyz-mining-redis-0
++ local ordinal=0
++ echo 0
+ default_initialize_pod_ordinal=0
+ continue
+ for pod in "${pod_list[@]}"
+ '[' -z 0 ']'
++ extract_ordinal_from_object_name aida-dev-xyz-mining-redis-1
++ local object_name=aida-dev-xyz-mining-redis-1
++ local ordinal=1
++ echo 1
+ pod_ordinal=1
+ '[' 1 -lt 0 ']'
+ default_redis_primary_pod_name=aida-dev-xyz-mining-redis-0
+ redis_default_primary_pod_headless_fqdn=aida-dev-xyz-mining-redis-0.aida-dev-xyz-mining-redis-headless.system-d-redis-aida-dev-xyz-mining.svc
+ init_redis_service_port
+ '[' -n 6379 ']'
+ redis_default_service_port=6379
+ parse_redis_advertised_svc_if_exist aida-dev-xyz-mining-redis-0
+ local pod_name=aida-dev-xyz-mining-redis-0
+ [[ -z '' ]]
+ echo 'Environment variable REDIS_ADVERTISED_PORT not found. Ignoring.'
Environment variable REDIS_ADVERTISED_PORT not found. Ignoring.
+ return 0
+ old_ifs='
'
+ IFS=,
+ set -f
+ read -ra sentinel_pod_list
+ set +f
+ IFS='
'
+ for sentinel_pod in "${sentinel_pod_list[@]}"
+ sentinel_pod_fqdn=aida-dev-xyz-mining-redis-sentinel-0.aida-dev-xyz-mining-redis-sentinel-headless
+ '[' -n '' ']'
+ echo 'register to sentinel:aida-dev-xyz-mining-redis-sentinel-0.aida-dev-xyz-mining-redis-sentinel-headless with ClusterIP service: redis_default_primary_pod_fqdn=aida-dev-xyz-mining-redis-0.aida-dev-xyz-mining-redis-headless.system-d-redis-aida-dev-xyz-mining.svc, redis_default_service_port=6379'
register to sentinel:aida-dev-xyz-mining-redis-sentinel-0.aida-dev-xyz-mining-redis-sentinel-headless with ClusterIP service: redis_default_primary_pod_fqdn=aida-dev-xyz-mining-redis-0.aida-dev-xyz-mining-redis-headless.system-d-redis-aida-dev-xyz-mining.svc, redis_default_service_port=6379
+ register_to_sentinel aida-dev-xyz-mining-redis-sentinel-0.aida-dev-xyz-mining-redis-sentinel-headless aida-dev-xyz-mining-redis aida-dev-xyz-mining-redis-0.aida-dev-xyz-mining-redis-headless.system-d-redis-aida-dev-xyz-mining.svc 6379
+ local sentinel_host=aida-dev-xyz-mining-redis-sentinel-0.aida-dev-xyz-mining-redis-sentinel-headless
+ local master_name=aida-dev-xyz-mining-redis
+ local sentinel_port=26379
+ local redis_primary_host=aida-dev-xyz-mining-redis-0.aida-dev-xyz-mining-redis-headless.system-d-redis-aida-dev-xyz-mining.svc
+ local redis_primary_port=6379
+ local timeout=600
++ date +%s
+ local start_time=1725899217
+ local current_time
+ set +x
Checking connectivity to aida-dev-xyz-mining-redis-sentinel-0.aida-dev-xyz-mining-redis-sentinel-headless on port 26379 using redis-cli...
Warning: Using a password with '-a' or '-u' option on the command line interface may not be safe.
aida-dev-xyz-mining-redis-sentinel-0.aida-dev-xyz-mining-redis-sentinel-headless is reachable on port 26379.
Checking connectivity to aida-dev-xyz-mining-redis-0.aida-dev-xyz-mining-redis-headless.system-d-redis-aida-dev-xyz-mining.svc on port 6379 using redis-cli...
Warning: Using a password with '-a' or '-u' option on the command line interface may not be safe.
aida-dev-xyz-mining-redis-0.aida-dev-xyz-mining-redis-headless.system-d-redis-aida-dev-xyz-mining.svc is reachable on port 6379.
Warning: Using a password with '-a' or '-u' option on the command line interface may not be safe.
ERR Duplicate master name.
Command failed with status 0 or output not OK.
mark@L-R910LPKW:~$ k get job
NAME                                                   STATUS   COMPLETIONS   DURATION   AGE
kb-post-provision-job-aida-dev-xyz-mining-redis   Failed   0/1           71s        71s
mark@L-R910LPKW:~$
Y-Rookie commented 1 month ago

@MarkKharitonov Thank you for raising this issue. In the current KubeBlocks Redis, the kb-post-provision-job-xxx is primarily used to register Redis to all Redis Sentinel instances, enabling high availability capabilities for the Redis cluster. Currently, the implementation of this job is not idempotent. When Redis successfully registers with some Redis Sentinel instances but fails to register with others (due to various reasons such as network connectivity issues or unhealthy instances), the post-provision-job fails and retries (which can also be triggered by deleting the job, as you mentioned). When the job retries, the Sentinel instances that have already been successfully registered will return the error "ERR Duplicate master name." This is the reason behind the issue you encountered. We will address this problem in the future by optimizing the Redis registration logic to make it idempotent. Thank you again for bringing this to our attention.

github-actions[bot] commented 2 days ago

This issue has been marked as stale because it has been open for 30 days with no activity