[bitnami/redis] Redis Sentinel won't startup when `networkPolicy.allowExternal: false` is set

jgkirschbaum commented 2 years ago

Name and Version

bitnami/redis 17.1.4

What steps will reproduce the bug?

helm upgrade --install dos-redis redis --values redis.yaml
Watch the result

kubectl get pod
NAME               READY   STATUS    RESTARTS        AGE
dos-redis-node-0   1/3     Running   8 (2m58s ago)   21m

Are you using any custom parameters or values?

redis.yaml

networkPolicy:
    enabled: true
    allowExternal: false

auth:
    password: "secret"

replica:
  podLabels:
    app: "redis"
  resources:
    limits:
      memory: "256Mi"
    requests:
      cpu: "0.2"
      memory: "256Mi"
  podSecurityContext:
    enabled: true
    fsGroup: 10101
  containerSecurityContext:
    enabled: true
    runAsUser: 10101
    runAsGroup: 10101
    allowPrivilegeEscalation: false
    runAsNonRoot: true
  persistence:
    enabled: false

sentinel:
  enabled: true
  image:
    pullPolicy: "Always"
  getMasterTimeout: 15
  downAfterMilliseconds: 10000
  failoverTimeout: 60000
  resources:
    limits:
      memory: "64Mi"
    requests:
      cpu: "0.1"
      memory: "64Mi"
  containerSecurityContext:
    enabled: true
    runAsUser: 10101
    runAsGroup: 10101
    allowPrivilegeEscalation: false
    runAsNonRoot: true

metrics:
  enabled: true
  image:
    pullPolicy: "Always"
  containerSecurityContext:
    enabled: true
    runAsUser: 10101
    runAsGroup: 10101
    allowPrivilegeEscalation: false
    runAsNonRoot: true
  resources:
    limits:
      memory: "64Mi"
    requests:
      cpu: "0.1"
      memory: "64Mi"
  serviceMonitor:
    enabled: true

What is the expected behavior?

Redis should form a 3-node HA-configuration.

What do you see instead?

First pod will never start and hang forever. It's the sentinel container.

kubectl get pod
NAME               READY   STATUS    RESTARTS      AGE
dos-redis-node-0   1/3     Running   6 (45s ago)   15m

Additional information

When changing the networkPolicy part of redis.yaml file to

networkPolicy:
    enabled: true
    allowExternal: true

everything works fine.

javsalgar commented 2 years ago

Hi,

Could you provide the logs of the pods as well? If you could launch it with image.debug=true it would be helpful

jgkirschbaum commented 2 years ago

Containers are now started with image.debug=true

redis.log

 05:26:37.74 INFO  ==> about to run the command: REDISCLI_AUTH=$REDIS_PASSWORD timeout 15 redis-cli -h dos-redis.dos-ig1.svc.cluster.local -p 26379 sentinel get-master-addr-by-name mymaster
 05:26:52.74 INFO  ==> Configuring the node as master
1:C 12 Sep 2022 05:26:52.749 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
1:C 12 Sep 2022 05:26:52.749 # Redis version=7.0.4, bits=64, commit=00000000, modified=0, pid=1, just started
1:C 12 Sep 2022 05:26:52.749 # Configuration loaded
1:M 12 Sep 2022 05:26:52.750 * monotonic clock: POSIX clock_gettime
1:M 12 Sep 2022 05:26:52.751 # Warning: Could not create server TCP listening socket ::*:6379: unable to bind socket, errno: 97
1:M 12 Sep 2022 05:26:52.751 * Running mode=standalone, port=6379.
1:M 12 Sep 2022 05:26:52.751 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
1:M 12 Sep 2022 05:26:52.751 # Server initialized
1:M 12 Sep 2022 05:26:52.754 * Creating AOF base file appendonly.aof.1.base.rdb on server start
1:M 12 Sep 2022 05:26:52.768 * Creating AOF incr file appendonly.aof.1.incr.aof on server start
1:M 12 Sep 2022 05:26:52.768 * Ready to accept connections
1:signal-handler (1662960651) Received SIGTERM scheduling shutdown...
1:M 12 Sep 2022 05:30:51.167 # User requested shutdown...
1:M 12 Sep 2022 05:30:51.167 * Calling fsync() on the AOF file.
1:M 12 Sep 2022 05:30:51.167 # Redis is now ready to exit, bye bye...

sentinel.log

 05:26:38.04 INFO  ==> about to run the command: REDISCLI_AUTH=$REDIS_PASSWORD timeout 15 redis-cli -h dos-redis.dos-ig1.svc.cluster.local -p 26379 sentinel get-master-addr-by-name mymaster
1:X 12 Sep 2022 05:26:53.071 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
1:X 12 Sep 2022 05:26:53.071 # Redis version=7.0.4, bits=64, commit=00000000, modified=0, pid=1, just started
1:X 12 Sep 2022 05:26:53.071 # Configuration loaded
1:X 12 Sep 2022 05:26:53.071 * monotonic clock: POSIX clock_gettime
1:X 12 Sep 2022 05:26:53.072 # Warning: Could not create server TCP listening socket ::*:26379: unable to bind socket, errno: 97
1:X 12 Sep 2022 05:26:53.073 * Running mode=sentinel, port=26379.
1:X 12 Sep 2022 05:26:53.073 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
1:X 12 Sep 2022 05:26:53.073 # Sentinel ID is fdc7c0cc7f493c5102debdadc69243a114b56565
1:X 12 Sep 2022 05:26:53.073 # +monitor master mymaster dos-redis-node-0.dos-redis-headless.dos-ig1.svc.cluster.local 6379 quorum 2
1:X 12 Sep 2022 05:31:01.284 # +sdown master mymaster dos-redis-node-0.dos-redis-headless.dos-ig1.svc.cluster.local 6379
1:signal-handler (1662960681) Received SIGTERM scheduling shutdown...
1:X 12 Sep 2022 05:31:21.337 # User requested shutdown...
1:X 12 Sep 2022 05:31:21.337 # Sentinel is now ready to exit, bye bye...

metrics.log

time="2022-09-12T05:26:38Z" level=info msg="Redis Metrics Exporter v1.43.1    build date: 2022-08-13-05:09:29    sha1: 78f04312879f5585f307c8bec9354de8250e47e9    Go: go1.19    GOOS: linux    GOARCH: amd64"
time="2022-09-12T05:26:38Z" level=info msg="Providing metrics at :9121/metrics"
time="2022-09-12T05:26:51Z" level=error msg="Couldn't connect to redis instance (redis://localhost:6379)"
time="2022-09-12T05:30:51Z" level=error msg="Couldn't connect to redis instance (redis://localhost:6379)"
time="2022-09-12T05:30:51Z" level=error msg="Couldn't connect to redis instance (redis://localhost:6379)"
time="2022-09-12T05:31:02Z" level=error msg="Couldn't connect to redis instance (redis://localhost:6379)"
time="2022-09-12T05:31:21Z" level=error msg="Couldn't connect to redis instance (redis://localhost:6379)"
time="2022-09-12T05:31:21Z" level=error msg="Couldn't connect to redis instance (redis://localhost:6379)"
time="2022-09-12T05:31:32Z" level=error msg="Couldn't connect to redis instance (redis://localhost:6379)"

kubectl describe pod dos-redis-node-0 (minimized)

Containers:
  redis:
    Image:         bitnami/redis:7.0.4-debian-11-r17
    Port:          6379/TCP
    Host Port:     0/TCP
    Command:
      /bin/bash
    Args:
      -c
      /opt/bitnami/scripts/start-scripts/start-node.sh
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Completed
      Exit Code:    0
    Ready:          False
    Restart Count:  202
    Liveness:   exec [sh -c /health/ping_liveness_local.sh 5] delay=20s timeout=5s period=5s #success=1 #failure=5
    Readiness:  exec [sh -c /health/ping_readiness_local.sh 1] delay=20s timeout=1s period=5s #success=1 #failure=5
    Startup:    tcp-socket :redis delay=10s timeout=5s period=10s #success=1 #failure=22
  sentinel:
    Image:         bitnami/redis-sentinel:7.0.4-debian-11-r14
    Port:          26379/TCP
    Host Port:     0/TCP
    Command:
      /bin/bash
    Args:
      -c
      /opt/bitnami/scripts/start-scripts/start-sentinel.sh
    State:          Running
    Last State:     Terminated
      Reason:       Completed
      Exit Code:    0
    Ready:          False
    Restart Count:  202
    Liveness:   exec [sh -c /health/ping_sentinel.sh 5] delay=20s timeout=5s period=5s #success=1 #failure=5
    Readiness:  exec [sh -c /health/ping_sentinel.sh 1] delay=20s timeout=1s period=5s #success=1 #failure=5
    Startup:    tcp-socket :redis-sentinel delay=10s timeout=5s period=10s #success=1 #failure=22
  metrics:
    Image:         bitnami/redis-exporter:1.43.1-debian-11-r4
    Port:          9121/TCP
    Host Port:     0/TCP
    Command:
      /bin/bash
      -c
      if [[ -f '/secrets/redis-password' ]]; then
          export REDIS_PASSWORD=$(cat /secrets/redis-password)
      fi
      redis_exporter

    State:          Running
    Ready:          True
    Restart Count:  0
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Events:
  Type     Reason             Age                     From     Message
  ----     ------             ----                    ----     -------
  Warning  FailedPreStopHook  48m (x196 over 23h)     kubelet  Exec lifecycle hook ([/bin/bash -c /opt/bitnami/scripts/start-scripts/prestop-redis.sh]) for Container "redis" in Pod "dos-redis-node-0_dos-dev(6b103f55-c323-4b1a-b137-9cae7ba6f081)" failed - error: command '/bin/bash -c /opt/bitnami/scripts/start-scripts/prestop-redis.sh' exited with 1: , message: "Waiting for sentinel to run failoverfor up to 20s\n"
  Warning  BackOff            43m (x2666 over 22h)    kubelet  Back-off restarting failed container
  Warning  Unhealthy          8m19s (x4442 over 23h)  kubelet  Startup probe failed: dial tcp 10.226.180.2:26379: i/o timeout
  Warning  Unhealthy          3m19s (x4466 over 23h)  kubelet  Startup probe failed: dial tcp 10.226.180.2:6379: i/o timeout

kubectl describe statefulsets.apps dos-redis-node (minimized)

  Containers:
   redis:
    Image:      bitnami/redis:7.0.4-debian-11-r17
    Port:       6379/TCP
    Host Port:  0/TCP
    Command:
      /bin/bash
    Args:
      -c
      /opt/bitnami/scripts/start-scripts/start-node.sh
    Liveness:   exec [sh -c /health/ping_liveness_local.sh 5] delay=20s timeout=5s period=5s #success=1 #failure=5
    Readiness:  exec [sh -c /health/ping_readiness_local.sh 1] delay=20s timeout=1s period=5s #success=1 #failure=5
    Startup:    tcp-socket :redis delay=10s timeout=5s period=10s #success=1 #failure=22
   sentinel:
    Image:      bitnami/redis-sentinel:7.0.4-debian-11-r14
    Port:       26379/TCP
    Host Port:  0/TCP
    Command:
      /bin/bash
    Args:
      -c
      /opt/bitnami/scripts/start-scripts/start-sentinel.sh
    Liveness:   exec [sh -c /health/ping_sentinel.sh 5] delay=20s timeout=5s period=5s #success=1 #failure=5
    Readiness:  exec [sh -c /health/ping_sentinel.sh 1] delay=20s timeout=1s period=5s #success=1 #failure=5
    Startup:    tcp-socket :redis-sentinel delay=10s timeout=5s period=10s #success=1 #failure=22
   metrics:
    Image:      bitnami/redis-exporter:1.43.1-debian-11-r4
    Port:       9121/TCP
    Host Port:  0/TCP
    Command:
      /bin/bash
      -c
      if [[ -f '/secrets/redis-password' ]]; then
          export REDIS_PASSWORD=$(cat /secrets/redis-password)
      fi
      redis_exporter
Volume Claims:  <none>
Events:         <none>

migruiz4 commented 2 years ago

Hi @jgkirschbaum,

Could you please provide more information about your environment? What CNI are you using? Have you omitted any additional value from the values.yaml provided?

I tried to reproduce the issue using Calico on a fresh install but I wasn't able to get the same behavior.

jgkirschbaum commented 2 years ago

We are running the VMware TKGI distribution with k8s v1.23.7 on premises. TKGI uses NSX-T CNI from VMware. Tested it also on k8s v1.21.9 with no success.

Omitted is (was) only the following part:

global:
  imageRegistry: "our-private-registry"
  imagePullSecrets: ["pullsecret"]

image:
  pullPolicy: "Always"
  debug: true

jgkirschbaum commented 2 years ago

What I figured out is, that the only difference in the deployment is the ingress: part in the generated network-policy, but this seems not to be suspicious.

migruiz4 commented 2 years ago

It is very hard to determine what could be the root cause of the issue, so in case it helps with troubleshooting, I would like to share here my thoughts about what could be the cause:

Using networkPolicy.allowExternal: true the Redis chart is installed successfully, but with networkPolicy.allowExternal: false. As you mentioned, the only difference caused by this value is the ingress section of the networkPolicy.

StartupProbes are failing with the following error, which would mean either the TCP socket is not open or it is not reachable.

Warning  Unhealthy          8m19s (x4442 over 23h)  kubelet  Startup probe failed: dial tcp 10.226.180.2:26379: i/o timeout
Warning  Unhealthy          3m19s (x4466 over 23h)  kubelet  Startup probe failed: dial tcp 10.226.180.2:6379: i/o timeout

It would be super odd, but I'm thinking that maybe with the networkPolicy ingress restricted, Kubernetes is not being able to properly execute the probes.

To discard this hypothesis, could you please try to install the chart with all probes disabled? Once installed, it would be helpful to know:

If Redis gets into the CrashLoopback status too without Probes enabled
If the Redis container did not crash, is port 6379 open?
If the port is open, can a client pod (with the proper label) reach and/or connect to Redis?

That troubleshooting may give us some additional info on what could be the cause of the problem.

jgkirschbaum commented 2 years ago

Starting Redis with all startup probes disabled would form a Redis Sentinel environment.

networkPolicy:
  allowExternal: false
replica:
  startupProbe:
    enabled: false
sentinel:
  startupProbe:
    enabled: false

If one of the two startup probes is not disabled, Redis Sentinel won't start up.

Starting a debug container with the configuration

apiVersion: v1
kind: Pod
metadata:
  labels:
    app: dos-redis-client
    dos-redis-client: "true"
  name: dos-redis-client
  namespace: dos-ig1
spec:
  imagePullSecrets:
    - name: pullsecret
  serviceAccountName: dos-redis
  terminationGracePeriodSeconds: 1
  containers:
    - image: bitnami/redis:7.0.4-debian-11-r17
      imagePullPolicy: Always
      name: dos-redis-client
      command:
        - sleep
      args:
        - infinity
      env:
        - name: REDISCLI_AUTH
          value: "secret"
      resources:
        limits:
          memory: 64Mi
        requests:
          cpu: 100m
          memory: 64Mi
      securityContext:
        allowPrivilegeEscalation: false
        runAsGroup: 10101
        runAsNonRoot: true
        runAsUser: 10101
  securityContext:
    fsGroup: 10101
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: dos-redis-client
  labels:
    app.kubernetes.io/managed-by: kirscju
spec:
  egress:
    - ports:
        - port: 53
          protocol: UDP
    - ports:
        - port: 6379
          protocol: TCP
        - port: 26379
          protocol: TCP
      to:
        - podSelector:
            matchLabels:
              app.kubernetes.io/instance: dos-redis
              app.kubernetes.io/name: redis
  podSelector:
    matchLabels:
      app: dos-redis-client
  policyTypes:
    - Egress

shows the following:

redis-cli -h dos-redis-node-0.dos-redis-headless.dos-ig1.svc.cluster.local -p 6379 info replication

role:master
connected_slaves:2
slave0:ip=dos-redis-node-1.dos-redis-headless.dos-ig1.svc.cluster.local,port=6379,state=online,offset=237117,lag=1
slave1:ip=dos-redis-node-2.dos-redis-headless.dos-ig1.svc.cluster.local,port=6379,state=online,offset=237117,lag=1

redis-cli -h dos-redis-node-1|2.dos-redis-headless.dos-ig1.svc.cluster.local -p 6379 info replication

role:slave
master_host:dos-redis-node-0.dos-redis-headless.dos-ig1.svc.cluster.local
master_port:6379
master_link_status:up
master_last_io_seconds_ago:0
master_sync_in_progress:0
slave_read_repl_offset:246189
slave_repl_offset:246189
slave_priority:100
slave_read_only:1
replica_announced:1
connected_slaves:0
master_failover_state:no-failover

redis-cli -h dos-redis-node-0|1|2.dos-redis-headless.dos-ig1.svc.cluster.local -p 26379 info sentinel

# Sentinel
sentinel_masters:1
sentinel_tilt:0
sentinel_tilt_since_seconds:-1
sentinel_running_scripts:0
sentinel_scripts_queue_length:0
sentinel_simulate_failure_flags:0
master0:name=mymaster,status=ok,address=dos-redis-node-0.dos-redis-headless.dos-ig1.svc.cluster.local:6379,slaves=2,sentinels=3

migruiz4 commented 2 years ago

Hi @jgkirschbaum,

Thank you very much for the detailed response.

It is weird that enable/disable is causing the issue, in addition to the i/o timeout on the TcpSocket probe from the previous pod describe output.

Warning Unhealthy 8m19s (x4442 over 23h) kubelet Startup probe failed: dial tcp 10.226.180.2:26379: i/o timeout Warning Unhealthy 3m19s (x4466 over 23h) kubelet Startup probe failed: dial tcp 10.226.180.2:6379: i/o timeout

I'm not sure if this could be caused by the chart itself or an issue with the CNI plugin, as the probes requests shouldn't be blocked. I suggest trying redis/sentinel.customStartupProbe to set a different Probe condition. Meanwhile, I will discuss this issue internally.

jgkirschbaum commented 2 years ago

@migruiz4 thank you for your effort, your feedback and your thoughts regarding debugging the startup issue.

Meanwhile I did a little further investigation and I'd like to share my results. As you mentioned I tried it with redis|sentinel.customStartupProbe and the following quick hack as configuration:

replica:
  startupProbe:
    enabled: false
  customStartupProbe:
    failureThreshold: 22
    initialDelaySeconds: 10
    periodSeconds: 10
    successThreshold: 1
    timeoutSeconds: 5
    exec:
      command:
        - "bash"
        - "-c"
        - "REDISCLI_AUTH=secret timeout 1 redis-cli -h 127.0.0.1 -p 6379 ping"

sentinel:
  startupProbe:
    enabled: false
  customStartupProbe:
    failureThreshold: 22
    initialDelaySeconds: 10
    periodSeconds: 10
    successThreshold: 1
    timeoutSeconds: 5
    exec:
      command:
        - "bash"
        - "-c"
        - "REDISCLI_AUTH=secret timeout 1 redis-cli -h 127.0.0.1 -p 26379 ping"

Everything works like a charme, all pods start up and form a Redis Sentinel environment.

Replacing the exec part in the startup probe like the helm chart does with

tcpSocket:
  port: 6379|26379

the pods won't startup.

In the k8s pod configuration documention and k8s pod lifecycle documention I found

For a TCP probe, the kubelet makes the probe connection at the node, not in the pod, which means that you can not use a service name in the host parameter since the kubelet is unable to resolve it.

When a Container lifecycle management hook is called, the Kubernetes management system executes the handler according to the hook action, httpGet and tcpSocket are executed by the kubelet process, and exec is executed in the container.

It's a bit unclear to me, but could it be, that the network policy generated by the helm chart with networkPolicy.allowExternal: false doesn't allow the kubelet to connect to the pod, i.e. containers?

jgkirschbaum commented 2 years ago

Additional information.

I've set up a stripped down scenario with a nginx pod and the corresponding network policies, so it looks like

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  labels:
    app.kubernetes.io/instance: dos-redis-proof
    app.kubernetes.io/name: redis-proof
  name: dos-redis-proof
spec:
  egress:
    - ports:
        - port: 53
          protocol: UDP
  ingress:
    - from:
        - podSelector:
            matchLabels:
              dos-redis-client: "true"
        - podSelector:
            matchLabels:
              app.kubernetes.io/instance: dos-redis-proof
              app.kubernetes.io/name: redis-proof
      ports:
        - port: 8080
          protocol: TCP
  podSelector:
    matchLabels:
      app.kubernetes.io/instance: dos-redis-proof
      app.kubernetes.io/name: redis-proof
  policyTypes:
    - Ingress
    - Egress

and for the pod

apiVersion: v1
kind: Pod
metadata:
  name: dos-redis-proof
  labels:
    app: dos-redis-proof
    app.kubernetes.io/instance: dos-redis-proof
    app.kubernetes.io/name: redis-proof
spec:
  imagePullSecrets:
    - name: pullsecret
  serviceAccountName: dos-redis-proof
  terminationGracePeriodSeconds: 1
  containers:
    - image: nginxinc/nginx-unprivileged:1.23.1
      imagePullPolicy: Always
      name: dos-redis-proof
      resources:
        limits:
          memory: 64Mi
        requests:
          cpu: 100m
          memory: 64Mi
      startupProbe:
        exec:
          command:
            - bash
            - -c
            - curl -sSo /dev/null http://localhost:8080/
      securityContext:
        allowPrivilegeEscalation: false
        runAsGroup: 83001
        runAsNonRoot: true
        runAsUser: 31127
  securityContext:
    fsGroup: 83001

Everything works as expected.

Changing the startup probe to

      startupProbe:
        tcpSocket:
          port: 8080

the pod will never startup.

I draw the conclusion that the k8s pod configuration documention and k8s pod lifecycle documention are right

For a TCP probe, the kubelet makes the probe connection at the node, not in the pod, which means that you can not use a service name in the host parameter since the kubelet is unable to resolve it.

When a Container lifecycle management hook is called, the Kubernetes management system executes the handler according to the hook action, httpGet and tcpSocket are executed by the kubelet process, and exec is executed in the container.

so that the kubelet can't reach the pod because of the network policy.

jgkirschbaum commented 2 years ago

Seems to be fixed in #12428

migruiz4 commented 2 years ago

Thank you very much for your patience in troubleshooting and detailed responses that helped us resolve this issue.

The latest version of the bitnami/redis chart already includes the fix replacing tcpSocket. Could you please confirm it is working without workarounds before we close this issue?

jgkirschbaum commented 2 years ago

Version 17.2.0 works like a charm. Thank you for fixing.

migruiz4 commented 2 years ago

Thank you for your feedback!

bitnami / charts