bitnami / charts

Bitnami Helm Charts
https://bitnami.com
Other
8.97k stars 9.2k forks source link

[bitnami/redis] Abnormal behavior of 'Redis + Sentinel' after server reboots #8878

Closed work4DevLMLOps closed 2 years ago

work4DevLMLOps commented 2 years ago

Which chart: https://github.com/bitnami/charts/tree/master/bitnami/redis:

[root@server1 redis]# helm ls -n redis
NAME    NAMESPACE       REVISION        UPDATED                                 STATUS          CHART           APP VERSION
redis   redis           1               2022-02-02 12:20:46.113567561 +0000 UTC deployed        redis-16.1.0    6.2.6

Describe the bug

To Reproduce

I have used above mentioned chart, and configured (Redis + Sentinel) successfully, however whenever I reboot my server it gives different different errors.

POD:

[root@server1  redis]# kubectl get pods,svc,ep,sts -n redis
NAME                                      READY   STATUS    RESTARTS      AGE
pod/redis-node-2                          3/3     Running   9 (44m ago)   4h44m
pod/redis-node-0                          3/3     Running   9 (44m ago)   4h45m
pod/redis-node-1                          3/3     Running   9 (44m ago)   4h44m

NAME                      TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)                                        AGE
service/redis-headless    ClusterIP   None            <none>        6379/TCP,26379/TCP                             4h45m
service/redis-metrics     ClusterIP   10.43.40.195    <none>        9121/TCP                                       4h45m
service/redis             ClusterIP   10.43.162.52    <none>        6379/TCP,26379/TCP                             4h45m

NAME                        ENDPOINTS                                                        AGE
endpoints/redis-headless    10.42.0.235:6379,10.42.0.236:6379,10.42.0.237:6379 + 3 more...   4h45m
endpoints/redis             10.42.0.235:6379,10.42.0.236:6379,10.42.0.237:6379 + 3 more...   4h45m
endpoints/redis-metrics     10.42.0.235:9121,10.42.0.236:9121,10.42.0.237:9121               4h45m

NAME                          READY   AGE
statefulset.apps/redis-node   3/3     4h45m
[root@server1 redis]#

Scenario 1:

In the below case the salves are not connected after server reboot.

[root@server1 redis]# kubectl -n redis exec -it redis-node-0 redis -- bash
Defaulted container "redis" out of: redis, sentinel, metrics
I have no name!@redis-node-0:/$ redis-cli
127.0.0.1:6379> auth passwd
OK
127.0.0.1:6379> info replication
 Replication
role:master
connected_slaves:0
master_failover_state:no-failover
master_replid:96f05c39fe1ea72d58343d6d9f0e4c982cfaf5fe
master_replid2:0000000000000000000000000000000000000000
master_repl_offset:0
second_repl_offset:-1
repl_backlog_active:0
repl_backlog_size:1048576
repl_backlog_first_byte_offset:0
repl_backlog_histlen:0
127.0.0.1:6379>

+++++++++++++++++++++++++++++

[root@server1 redis]# kubectl -n redis exec -it redis-node-0 sentinel -- bash
Defaulted container "redis" out of: redis, sentinel, metrics
I have no name!@redis-node-0:/$ redis-cli -p 26379
127.0.0.1:26379> auth passwd
OK
127.0.0.1:26379> sentinel master mymaster
 1) "name"
 2) "mymaster"
 3) "ip"
 4) "redis-node-0.redis-headless.redis.svc.cluster.local"
 5) "port"
 6) "6379"
 7) "runid"
 8) "f8fd9a63cfe44c63b0859737bb2e9334313fff9c"
 9) "flags"
10) "master"
11) "link-pending-commands"
12) "0"
13) "link-refcount"
14) "1"
15) "last-ping-sent"
16) "0"
17) "last-ok-ping-reply"
18) "30"
19) "last-ping-reply"
20) "30"
21) "down-after-milliseconds"
22) "60000"
23) "info-refresh"
24) "8028"
25) "role-reported"
26) "master"
27) "role-reported-time"
28) "2207370"
29) "config-epoch"
30) "0"
31) "num-slaves"
32) "0"
33) "num-other-sentinels"
34) "0"
35) "quorum"
36) "2"
37) "failover-timeout"
38) "18000"
39) "parallel-syncs"
40) "1"

127.0.0.1:26379> sentinel slaves  mymaster
(empty array)
127.0.0.1:26379>

Scenario 2:

In the below case the salves are not able to connect after server reboot. redis.redis.svc.cluster.local:26379: Connection refused

[root@server1 redis]# kubectl -n redis logs -f redis-node-2 -c redis
 12:31:49.94 INFO  ==> about to run the command: REDISCLI_AUTH=$REDIS_PASSWORD timeout 5 redis-cli -h redis.redis.svc.cluster.local -p 26379 sentinel get-master-addr-by-name mymaster
Could not connect to Redis at redis.redis.svc.cluster.local:26379: Connection refused
1:C 02 Feb 2022 12:31:51.955 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
1:C 02 Feb 2022 12:31:51.955 # Redis version=6.2.6, bits=64, commit=00000000, modified=0, pid=1, just started
1:C 02 Feb 2022 12:31:51.955 # Configuration loaded
1:M 02 Feb 2022 12:31:51.956 * monotonic clock: POSIX clock_gettime
1:M 02 Feb 2022 12:31:51.956 * Running mode=standalone, port=6379.
1:M 02 Feb 2022 12:31:51.956 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
1:M 02 Feb 2022 12:31:51.956 # Server initialized
1:M 02 Feb 2022 12:31:51.957 * Reading RDB preamble from AOF file...
1:M 02 Feb 2022 12:31:51.957 * Loading RDB produced by version 6.2.6
1:M 02 Feb 2022 12:31:51.957 * RDB age 583 seconds
1:M 02 Feb 2022 12:31:51.957 * RDB memory usage when created 1.79 Mb
1:M 02 Feb 2022 12:31:51.957 * RDB has an AOF tail
1:M 02 Feb 2022 12:31:51.957 # Done loading RDB, keys loaded: 0, keys expired: 0.
1:M 02 Feb 2022 12:31:51.957 * Reading the remaining AOF tail...
1:M 02 Feb 2022 12:31:51.958 * DB loaded from append only file: 0.001 seconds
1:M 02 Feb 2022 12:31:51.958 * Ready to accept connections
^C
[root@server1 redis]# kubectl -n redis logs -f redis-node-2 -c sentinel
 12:31:51.58 INFO  ==> about to run the command: REDISCLI_AUTH=$REDIS_PASSWORD timeout 5 redis-cli -h redis.redis.svc.cluster.local -p 26379 sentinel get-master-addr-by-name mymaster
Could not connect to Redis at redis.redis.svc.cluster.local:26379: Connection refused
1:X 02 Feb 2022 12:31:53.527 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
1:X 02 Feb 2022 12:31:53.527 # Redis version=6.2.6, bits=64, commit=00000000, modified=0, pid=1, just started
1:X 02 Feb 2022 12:31:53.527 # Configuration loaded
1:X 02 Feb 2022 12:31:53.528 * monotonic clock: POSIX clock_gettime
1:X 02 Feb 2022 12:31:53.528 * Running mode=sentinel, port=26379.
1:X 02 Feb 2022 12:31:53.528 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
1:X 02 Feb 2022 12:31:53.529 # Sentinel ID is 9fe32540b27937ed9f341b0f610a0d8df405bb63
1:X 02 Feb 2022 12:31:53.529 # +monitor master mymaster redis-node-2.redis-headless.redis.svc.cluster.local 6379 quorum 2

[root@server1 redis]# kubectl -n redis logs -f redis-node-0 -c sentinel
 12:31:51.59 INFO  ==> about to run the command: REDISCLI_AUTH=$REDIS_PASSWORD timeout 5 redis-cli -h redis.redis.svc.cluster.local -p 26379 sentinel get-master-addr-by-name mymaster
Could not connect to Redis at redis.redis.svc.cluster.local:26379: Connection refused
1:X 02 Feb 2022 12:31:53.527 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
1:X 02 Feb 2022 12:31:53.527 # Redis version=6.2.6, bits=64, commit=00000000, modified=0, pid=1, just started
1:X 02 Feb 2022 12:31:53.527 # Configuration loaded
1:X 02 Feb 2022 12:31:53.528 * monotonic clock: POSIX clock_gettime
1:X 02 Feb 2022 12:31:53.528 * Running mode=sentinel, port=26379.
1:X 02 Feb 2022 12:31:53.528 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
1:X 02 Feb 2022 12:31:53.529 # Sentinel ID is 2a09ba7abbb41ee71e79087310d75f9809c3c815
1:X 02 Feb 2022 12:31:53.529 # +monitor master mymaster redis-node-0.redis-headless.redis.svc.cluster.local 6379 quorum 2
^C
[root@server1 redis]# kubectl -n redis logs -f redis-node-0 -c redis
 12:31:49.94 INFO  ==> about to run the command: REDISCLI_AUTH=$REDIS_PASSWORD timeout 5 redis-cli -h redis.redis.svc.cluster.local -p 26379 sentinel get-master-addr-by-name mymaster
Could not connect to Redis at redis.redis.svc.cluster.local:26379: Connection refused
1:C 02 Feb 2022 12:31:51.954 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
1:C 02 Feb 2022 12:31:51.954 # Redis version=6.2.6, bits=64, commit=00000000, modified=0, pid=1, just started
1:C 02 Feb 2022 12:31:51.954 # Configuration loaded
1:M 02 Feb 2022 12:31:51.955 * monotonic clock: POSIX clock_gettime
1:M 02 Feb 2022 12:31:51.955 * Running mode=standalone, port=6379.
1:M 02 Feb 2022 12:31:51.955 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
1:M 02 Feb 2022 12:31:51.955 # Server initialized
1:M 02 Feb 2022 12:31:51.955 * Ready to accept connections
^C
[root@server1 redis]# kubectl -n redis logs -f redis-node-1 -c redis
 12:31:49.94 INFO  ==> about to run the command: REDISCLI_AUTH=$REDIS_PASSWORD timeout 5 redis-cli -h redis.redis.svc.cluster.local -p 26379 sentinel get-master-addr-by-name mymaster
Could not connect to Redis at redis.redis.svc.cluster.local:26379: Connection refused
1:C 02 Feb 2022 12:31:51.889 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
1:C 02 Feb 2022 12:31:51.889 # Redis version=6.2.6, bits=64, commit=00000000, modified=0, pid=1, just started
1:C 02 Feb 2022 12:31:51.889 # Configuration loaded
1:M 02 Feb 2022 12:31:51.890 * monotonic clock: POSIX clock_gettime
1:M 02 Feb 2022 12:31:51.890 * Running mode=standalone, port=6379.
1:M 02 Feb 2022 12:31:51.890 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
1:M 02 Feb 2022 12:31:51.890 # Server initialized
1:M 02 Feb 2022 12:31:51.891 * Reading RDB preamble from AOF file...
1:M 02 Feb 2022 12:31:51.891 * Loading RDB produced by version 6.2.6
1:M 02 Feb 2022 12:31:51.891 * RDB age 621 seconds
1:M 02 Feb 2022 12:31:51.891 * RDB memory usage when created 1.79 Mb
1:M 02 Feb 2022 12:31:51.891 * RDB has an AOF tail
1:M 02 Feb 2022 12:31:51.891 # Done loading RDB, keys loaded: 0, keys expired: 0.
1:M 02 Feb 2022 12:31:51.891 * Reading the remaining AOF tail...
1:M 02 Feb 2022 12:31:51.891 * DB loaded from append only file: 0.001 seconds
1:M 02 Feb 2022 12:31:51.891 * Ready to accept connections
^C
[root@server1 redis]# kubectl -n redis logs -f redis-node-1 -c sentinel
 12:31:51.71 INFO  ==> about to run the command: REDISCLI_AUTH=$REDIS_PASSWORD timeout 5 redis-cli -h redis.redis.svc.cluster.local -p 26379 sentinel get-master-addr-by-name mymaster
Could not connect to Redis at redis.redis.svc.cluster.local:26379: Connection refused
1:X 02 Feb 2022 12:31:53.527 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
1:X 02 Feb 2022 12:31:53.527 # Redis version=6.2.6, bits=64, commit=00000000, modified=0, pid=1, just started
1:X 02 Feb 2022 12:31:53.527 # Configuration loaded
1:X 02 Feb 2022 12:31:53.528 * monotonic clock: POSIX clock_gettime
1:X 02 Feb 2022 12:31:53.528 * Running mode=sentinel, port=26379.
1:X 02 Feb 2022 12:31:53.528 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
1:X 02 Feb 2022 12:31:53.528 # Sentinel ID is 33535e4e17bf8f9f9ff9ce8f9ddf609e558ff4f2
1:X 02 Feb 2022 12:31:53.528 # +monitor master mymaster redis-node-1.redis-headless.redis.svc.cluster.local 6379 quorum 2

Could anyone help me?

Thanks & Regards, Keshaba Mahapatra.

migruiz4 commented 2 years ago

Hi @ping2kpm, thank you for reporting this issue.

I think that I know what may be happening.

Looks like, under certain circumstances, the Redis service may not be available after a server reboot, probably because all 3 pods started at the same time and none of them was recognized as a master by the others.

This caused each node to think they are the master, running into a split-brain scenario where the quorum condition will never be fulfilled:

+monitor master mymaster redis-node-1.redis-headless.redis.svc.cluster.local 6379 quorum 2
+monitor master mymaster redis-node-0.redis-headless.redis.svc.cluster.local 6379 quorum 2
+monitor master mymaster redis-node-2.redis-headless.redis.svc.cluster.local 6379 quorum 2

Caused by: https://github.com/bitnami/charts/blob/dcec32e41d1e608f4e1f784e8d15ac3f6853d296/bitnami/redis/templates/scripts-configmap.yaml#L81-L90 https://github.com/bitnami/charts/blob/dcec32e41d1e608f4e1f784e8d15ac3f6853d296/bitnami/redis/templates/scripts-configmap.yaml#L101-L108

When restarted normally, Redis will perform a RollingUpdate (unless you overwrite the value updateStrategy), and nodes 1 and 2 won't start until the previous node is initialized and ready, but it is possible that if Kubernetes is rebooted it could be restarting all of pods simultaneously, although I'm not 100% sure about that.

I have created an internal task to take a deeper look at this race condition and see what can be done to protect Redis against this scenario.

At the moment, I think deleting Redis pods should resolve this and Kubernetes will recreate them ordered so they can assume a role different to master and start working.

alemorcuq commented 2 years ago

Hi @ping2kpm,

Could you test this removing the timeout 5 bit from these two lines (83 and 85)?

https://github.com/bitnami/charts/blob/dcec32e41d1e608f4e1f784e8d15ac3f6853d296/bitnami/redis/templates/scripts-configmap.yaml#L83-85

    get_sentinel_master_info() {
        if is_boolean_yes "$REDIS_SENTINEL_TLS_ENABLED"; then
            sentinel_info_command="{{- if and .Values.auth.enabled .Values.auth.sentinel }}REDISCLI_AUTH="\$REDIS_PASSWORD" {{ end }}timeout 5 redis-cli -h $REDIS_SERVICE -p $SENTINEL_SERVICE_PORT --tls --cert ${REDIS_SENTINEL_TLS_CERT_FILE} --key ${REDIS_SENTINEL_TLS_KEY_FILE} --cacert ${REDIS_SENTINEL_TLS_CA_FILE} sentinel get-master-addr-by-name {{ .Values.sentinel.masterSet }}"
        else
            sentinel_info_command="{{- if and .Values.auth.enabled .Values.auth.sentinel }}REDISCLI_AUTH="\$REDIS_PASSWORD" {{ end }}timeout 5 redis-cli -h $REDIS_SERVICE -p $SENTINEL_SERVICE_PORT sentinel get-master-addr-by-name {{ .Values.sentinel.masterSet }}"
        fi
work4DevLMLOps commented 2 years ago

Hi @ping2kpm,

Could you test this removing the timeout 5 bit from these two lines (83 and 85)?

https://github.com/bitnami/charts/blob/dcec32e41d1e608f4e1f784e8d15ac3f6853d296/bitnami/redis/templates/scripts-configmap.yaml#L83-85

    get_sentinel_master_info() {
        if is_boolean_yes "$REDIS_SENTINEL_TLS_ENABLED"; then
            sentinel_info_command="{{- if and .Values.auth.enabled .Values.auth.sentinel }}REDISCLI_AUTH="\$REDIS_PASSWORD" {{ end }}timeout 5 redis-cli -h $REDIS_SERVICE -p $SENTINEL_SERVICE_PORT --tls --cert ${REDIS_SENTINEL_TLS_CERT_FILE} --key ${REDIS_SENTINEL_TLS_KEY_FILE} --cacert ${REDIS_SENTINEL_TLS_CA_FILE} sentinel get-master-addr-by-name {{ .Values.sentinel.masterSet }}"
        else
            sentinel_info_command="{{- if and .Values.auth.enabled .Values.auth.sentinel }}REDISCLI_AUTH="\$REDIS_PASSWORD" {{ end }}timeout 5 redis-cli -h $REDIS_SERVICE -p $SENTINEL_SERVICE_PORT sentinel get-master-addr-by-name {{ .Values.sentinel.masterSet }}"
        fi

No luck 127.0.0.1:26379> sentinel master mymaster 1) "name" 2) "mymaster" 3) "ip" 4) "redis-node-0.redis-headless.redis.svc.cluster.local" 5) "port" 6) "6379" 7) "runid" 8) "efdbedd3d138e20483f11ea302263fb005ef164f" 9) "flags" 10) "master" 11) "link-pending-commands" 12) "0" 13) "link-refcount" 14) "1" 15) "last-ping-sent" 16) "0" 17) "last-ok-ping-reply" 18) "158" 19) "last-ping-reply" 20) "158" 21) "down-after-milliseconds" 22) "60000" 23) "info-refresh" 24) "757" 25) "role-reported" 26) "master" 27) "role-reported-time" 28) "602794" 29) "config-epoch" 30) "0" 31) "num-slaves" 32) "0" 33) "num-other-sentinels" 34) "0" 35) "quorum" 36) "2" 37) "failover-timeout" 38) "18000" 39) "parallel-syncs" 40) "1" 127.0.0.1:26379> sentinel slaves mymaster (empty array) 127.0.0.1:26379>

qeternity commented 2 years ago

Can confirm - just encountered this today as we restarted all nodes in a cluster at the same time.

migruiz4 commented 2 years ago

Hi,

I'm adding the 'on-hold' label so the stale-bot does not close this issue while we investigate this.

@ping2kpm, does the issue persists after the pods were deleted and recreated one by one? It would also help us if you could share the output of kubectl describe pod redis-node-0.

work4DevLMLOps commented 2 years ago

No it doesn't persists, once pods are deleted & recreated cluster formed again without any error, after server reboot.

Pod described:

[root@server1 ~]# kubectl describe -n redis pods redis-node-0
Name:         redis-node-0
Namespace:    redis
Priority:     0
Node:         server1.dev.wkelms.com/10.234.82.180
Start Time:   Fri, 04 Feb 2022 18:08:33 +0000
Labels:       app.kubernetes.io/component=node
              app.kubernetes.io/instance=redis
              app.kubernetes.io/managed-by=Helm
              app.kubernetes.io/name=redis
              controller-revision-hash=redis-node-6f7cd45777
              helm.sh/chart=redis-16.1.0
              statefulset.kubernetes.io/pod-name=redis-node-0
Annotations:  checksum/configmap: e5c6cdd414b147061893ad54903d32b99a8a6918ffc7c4688e8e6fc88d205738
              checksum/health: 22f6a8a9b9adb4e72cc8a50f9fe62e0db9e2cca2e43657c039333c911a709210
              checksum/scripts: dc8c62f1901eba5a8368920144ed4ebcd2936094eec063cbbb9a4ddd70dcd279
              checksum/secret: db2bacaf687eeffa3ff3a3d1da2b2d69836b8e2045647ec7bacb2ffdc4a42b6b
              prometheus.io/port: 9121
              prometheus.io/scrape: true
Status:       Running
IP:           10.42.0.131
IPs:
  IP:           10.42.0.131
Controlled By:  StatefulSet/redis-node
Containers:
  redis:
    Container ID:  containerd://337728d497a8d180f7a21d3ca088a97e696f267c490555fa228ca37faa48ee38
    Image:         docker.io/bitnami/redis:6.2.6-debian-10-r103
    Image ID:      docker.io/bitnami/redis@sha256:3d6055b1addad726b590df6d75a538a64d29f0d44c0dcf39c855173c0a3eb2da
    Port:          6379/TCP
    Host Port:     0/TCP
    Command:
      /bin/bash
    Args:
      -c
      /opt/bitnami/scripts/start-scripts/start-node.sh
    State:          Running
      Started:      Mon, 07 Feb 2022 16:02:51 +0000
    Last State:     Terminated
      Reason:       Unknown
      Exit Code:    255
      Started:      Fri, 04 Feb 2022 18:08:34 +0000
      Finished:     Mon, 07 Feb 2022 16:01:52 +0000
    Ready:          True
    Restart Count:  1
    Liveness:       exec [sh -c /health/ping_liveness_local.sh 5] delay=20s timeout=5s period=5s #success=1 #failure=5
    Readiness:      exec [sh -c /health/ping_readiness_local.sh 5] delay=20s timeout=1s period=5s #success=1 #failure=5
    Environment:
      BITNAMI_DEBUG:             false
      REDIS_MASTER_PORT_NUMBER:  6379
      ALLOW_EMPTY_PASSWORD:      no
      REDIS_PASSWORD:            <set to the key 'redis-password' in secret 'redis'>  Optional: false
      REDIS_MASTER_PASSWORD:     <set to the key 'redis-password' in secret 'redis'>  Optional: false
      REDIS_TLS_ENABLED:         no
      REDIS_PORT:                6379
      REDIS_DATA_DIR:            /data
    Mounts:
      /data from redis-data (rw)
      /health from health (rw)
      /opt/bitnami/redis/etc from redis-tmp-conf (rw)
      /opt/bitnami/redis/mounted-etc from config (rw)
      /opt/bitnami/scripts/start-scripts from start-scripts (rw)
      /tmp from tmp (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-sc754 (ro)
  sentinel:
    Container ID:  containerd://e324bd7bc468903c2787a9de34204f34f200dee2ed32d76a32854e9b26457af9
    Image:         docker.io/bitnami/redis-sentinel:6.2.6-debian-10-r100
    Image ID:      docker.io/bitnami/redis-sentinel@sha256:af140136548ce0359e595ccd7b24a435b00549135ef77d38818601e2f17f90c7
    Port:          26379/TCP
    Host Port:     0/TCP
    Command:
      /bin/bash
    Args:
      -c
      /opt/bitnami/scripts/start-scripts/start-sentinel.sh
    State:          Running
      Started:      Mon, 07 Feb 2022 16:02:53 +0000
    Last State:     Terminated
      Reason:       Unknown
      Exit Code:    255
      Started:      Fri, 04 Feb 2022 18:08:34 +0000
      Finished:     Mon, 07 Feb 2022 16:01:53 +0000
    Ready:          True
    Restart Count:  1
    Liveness:       exec [sh -c /health/ping_sentinel.sh 5] delay=20s timeout=5s period=5s #success=1 #failure=5
    Readiness:      exec [sh -c /health/ping_sentinel.sh 5] delay=20s timeout=1s period=5s #success=1 #failure=5
    Environment:
      BITNAMI_DEBUG:               false
      REDIS_PASSWORD:              <set to the key 'redis-password' in secret 'redis'>  Optional: false
      REDIS_SENTINEL_TLS_ENABLED:  no
      REDIS_SENTINEL_PORT:         26379
    Mounts:
      /data from redis-data (rw)
      /health from health (rw)
      /opt/bitnami/redis-sentinel/etc from sentinel-tmp-conf (rw)
      /opt/bitnami/redis-sentinel/mounted-etc from config (rw)
      /opt/bitnami/scripts/start-scripts from start-scripts (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-sc754 (ro)
  metrics:
    Container ID:  containerd://29f87b012ebab2413aba1a9cb6209f1142aa7eda10e7a4e5f128c0105322417c
    Image:         docker.io/bitnami/redis-exporter:1.33.0-debian-10-r27
    Image ID:      docker.io/bitnami/redis-exporter@sha256:a828ccc45a0542cf6066bf7487d168acdabac829a79d6d3e1aa95ca19b1fcfa0
    Port:          9121/TCP
    Host Port:     0/TCP
    Command:
      /bin/bash
      -c
      if [[ -f '/secrets/redis-password' ]]; then
          export REDIS_PASSWORD=$(cat /secrets/redis-password)
      fi
      redis_exporter

    State:          Running
      Started:      Mon, 07 Feb 2022 16:03:01 +0000
    Last State:     Terminated
      Reason:       Unknown
      Exit Code:    255
      Started:      Fri, 04 Feb 2022 18:08:34 +0000
      Finished:     Mon, 07 Feb 2022 16:01:51 +0000
    Ready:          True
    Restart Count:  1
    Environment:
      REDIS_ALIAS:     redis
      REDIS_USER:      default
      REDIS_PASSWORD:  <set to the key 'redis-password' in secret 'redis'>  Optional: false
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-sc754 (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True
Volumes:
  redis-data:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  redis-data-redis-node-0
    ReadOnly:   false
  start-scripts:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      redis-scripts
    Optional:  false
  health:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      redis-health
    Optional:  false
  config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      redis-configuration
    Optional:  false
  sentinel-tmp-conf:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
  redis-tmp-conf:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
  tmp:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
  kube-api-access-sc754:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:                      <none>
[root@server1 ~]#
migruiz4 commented 2 years ago

Thank you for your feedback @ping2kpm, that may confirm my suspicion that the issue is caused by all pods starting simultaneously after Kubernetes is rebooted.

Since Redis is not a cloud-native application, the chart implements a method to choose a single master when using several replicas, but requires that a master is already up when the slave nodes initialize. We may need to make some changes to that method to prevent this scenario.

work4DevLMLOps commented 2 years ago

Ywc!, for time being, I have created one script to autofix it if server reboot. A cronjob required.

!/bin/bash

nodes=redis-node-0,redis-node-1,redis-node-2 for i in ${nodes//,/ } do MASTER_STATUS=$(kubectl -n redis exec -it $i -c redis -- bash -c 'redis-cli --no-auth-warning --raw -h $HOSTNAME -p $REDIS_SERVICE_PORT_TCP_REDIS -a $REDIS_PASSWORD info|grep role |cut -c1-11|cut -c6-'|dos2unix) export MASTER_STATUS if [ "master" = "$MASTER_STATUS" ]; then echo "Found the the $MASTER_STATUS & node name is $i, checking the replicas count" REPLICAS_COUNT=$(kubectl -n redis exec -it $i -c redis -- bash -c 'redis-cli --no-auth-warning -p 6379 -a $REDIS_PASSWORD info|grep connected_slaves|cut -d: -f2|cut -c1-1'|dos2unix) export REPLICAS_COUNT if [ "$REPLICAS_COUNT" -lt 2 ]; then echo "The Slave replica count is $REPLICAS_COUNT, proceeding for reform the cluster" kubectl -n redis delete pods redis-node-0; sleep 60 else echo "The Slave replica count is $REPLICAS_COUNT = 2, REDIS-CLUSTER WORKING AS EXPECTED, no action require, exiting " exit 0 fi else echo "it is a slave node and node name is $i" fi done

Thanks & Regards.

qeternity commented 2 years ago

Since Redis is not a cloud-native application, the chart implements a method to choose a single master when using several replicas, but requires that a master is already up when the slave nodes initialize. We may need to make some changes to that method to prevent this scenario.

I'm not really sure what this means as cloud-native is just marketing fluff, but the way that we do this in our homegrown is to use a native k8s lock when initializing a new leader. The followers who lost the leader race should block until the leader is initialized. There are well trodden k8s apis for this.

I would strongly advise people not to use this in production, or at all. We began to re-evaluate this chart after having issues with it previously to see if things had improved. I have the utmost respect for the bitnami team, but this chart has never been functional, and really should not be published.

migruiz4 commented 2 years ago

Hi @qeternity,

I'm sorry your experience with the Redis chart was not positive. Our team and the users who contribute to this chart, either by reporting issues or submitting PRs, try to continuously improve it.

Cloud-native may be used as marketing fluff most of the time, but in this case, I used it to refer to the issues we encounter because Redis is not designed to work in containerized environments.

To fix those issues, we have to create workarounds either in the chart or in the container logic. Some examples of issues those workarounds:

And many other things such as manually updating the cluster and shard balance each time the cluster is restarted and so on.

The impact of these design decisions may be very reduced when working in a VM cluster, where operations are performed manually by the cluster administrator, or some scenarios may be very infrequent such as network changes or the simultaneous restarts of several VMs.

Please be aware that these issues may exist when using the chart, and we appreciate you reporting them when possible. For those who like to get their hands dirty and would like to contribute, we will be very happy to review their PRs.

The feedback and contributions help us make this chart more stable, and of course, all the feedback shared with the Redis community will help replace custom workarounds with built-in features.

qeternity commented 2 years ago

Hi @migruiz4,

Thanks for the reply, and thanks for all of Bitnami's work in general (we use a few other charts that we are very happy with and grateful form and certainly none that we feel entitled to). I have just pulled my hair out with this chart a handful of times, so if I seem exasperated, it's only because I would very much like to migrate our hand-rolled system into something like this.

Re: cloud native, totally fair and you're absolutely right - the Sentinel config is not conducive to service definitions and virtual ips. Part of our frustration in general with Redis is that HA and clustering have been developed as afterthoughts, and that becomes painfully evident in these circumstances. Memcache stands head and shoulders above in these regards.

I will definitely spend some more time putting this chart through its paces, and will hopefully have some fixes to upstream in the next month or so.

h0jeZvgoxFepBQ2C commented 2 years ago

@qeternity we have the same problem as it seems and it really surprised our team that this is running so unstable. Did you find a better solution for this whole class of problems?

qeternity commented 2 years ago

@h0jeZvgoxFepBQ2C we are pinned to an older version of the chart that we found to be stable, combined with an init container that takes a k8s lock to coordinate startups.

javsalgar commented 2 years ago

@qeternity could you share what the logic for the kubernetes lock is? We would like to evaluate it as it may make sense to add it to the chart, at least as an experimental feature.

javsalgar commented 2 years ago

Hi,

I created this PR https://github.com/bitnami/charts/pull/9282 that adds experimental support for persisting the sentinel.conf file. Thanks to this, this issue could be mitigated. I did not enable it by default as we would like to have first feedback from the community.

All input is appreciated!

qeternity commented 2 years ago

@javsalgar just had a look over the PR - I think this is a better approach than our init container lock. I will deploy it to our dev cluster for some testing.

h0jeZvgoxFepBQ2C commented 1 year ago

And how did it work out?