Closed jgkirschbaum closed 2 years ago
Hi,
Could you provide the logs of the pods as well? If you could launch it with image.debug=true it would be helpful
Containers are now started with image.debug=true
redis.log
05:26:37.74 INFO ==> about to run the command: REDISCLI_AUTH=$REDIS_PASSWORD timeout 15 redis-cli -h dos-redis.dos-ig1.svc.cluster.local -p 26379 sentinel get-master-addr-by-name mymaster
05:26:52.74 INFO ==> Configuring the node as master
1:C 12 Sep 2022 05:26:52.749 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
1:C 12 Sep 2022 05:26:52.749 # Redis version=7.0.4, bits=64, commit=00000000, modified=0, pid=1, just started
1:C 12 Sep 2022 05:26:52.749 # Configuration loaded
1:M 12 Sep 2022 05:26:52.750 * monotonic clock: POSIX clock_gettime
1:M 12 Sep 2022 05:26:52.751 # Warning: Could not create server TCP listening socket ::*:6379: unable to bind socket, errno: 97
1:M 12 Sep 2022 05:26:52.751 * Running mode=standalone, port=6379.
1:M 12 Sep 2022 05:26:52.751 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
1:M 12 Sep 2022 05:26:52.751 # Server initialized
1:M 12 Sep 2022 05:26:52.754 * Creating AOF base file appendonly.aof.1.base.rdb on server start
1:M 12 Sep 2022 05:26:52.768 * Creating AOF incr file appendonly.aof.1.incr.aof on server start
1:M 12 Sep 2022 05:26:52.768 * Ready to accept connections
1:signal-handler (1662960651) Received SIGTERM scheduling shutdown...
1:M 12 Sep 2022 05:30:51.167 # User requested shutdown...
1:M 12 Sep 2022 05:30:51.167 * Calling fsync() on the AOF file.
1:M 12 Sep 2022 05:30:51.167 # Redis is now ready to exit, bye bye...
sentinel.log
05:26:38.04 INFO ==> about to run the command: REDISCLI_AUTH=$REDIS_PASSWORD timeout 15 redis-cli -h dos-redis.dos-ig1.svc.cluster.local -p 26379 sentinel get-master-addr-by-name mymaster
1:X 12 Sep 2022 05:26:53.071 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
1:X 12 Sep 2022 05:26:53.071 # Redis version=7.0.4, bits=64, commit=00000000, modified=0, pid=1, just started
1:X 12 Sep 2022 05:26:53.071 # Configuration loaded
1:X 12 Sep 2022 05:26:53.071 * monotonic clock: POSIX clock_gettime
1:X 12 Sep 2022 05:26:53.072 # Warning: Could not create server TCP listening socket ::*:26379: unable to bind socket, errno: 97
1:X 12 Sep 2022 05:26:53.073 * Running mode=sentinel, port=26379.
1:X 12 Sep 2022 05:26:53.073 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
1:X 12 Sep 2022 05:26:53.073 # Sentinel ID is fdc7c0cc7f493c5102debdadc69243a114b56565
1:X 12 Sep 2022 05:26:53.073 # +monitor master mymaster dos-redis-node-0.dos-redis-headless.dos-ig1.svc.cluster.local 6379 quorum 2
1:X 12 Sep 2022 05:31:01.284 # +sdown master mymaster dos-redis-node-0.dos-redis-headless.dos-ig1.svc.cluster.local 6379
1:signal-handler (1662960681) Received SIGTERM scheduling shutdown...
1:X 12 Sep 2022 05:31:21.337 # User requested shutdown...
1:X 12 Sep 2022 05:31:21.337 # Sentinel is now ready to exit, bye bye...
metrics.log
time="2022-09-12T05:26:38Z" level=info msg="Redis Metrics Exporter v1.43.1 build date: 2022-08-13-05:09:29 sha1: 78f04312879f5585f307c8bec9354de8250e47e9 Go: go1.19 GOOS: linux GOARCH: amd64"
time="2022-09-12T05:26:38Z" level=info msg="Providing metrics at :9121/metrics"
time="2022-09-12T05:26:51Z" level=error msg="Couldn't connect to redis instance (redis://localhost:6379)"
time="2022-09-12T05:30:51Z" level=error msg="Couldn't connect to redis instance (redis://localhost:6379)"
time="2022-09-12T05:30:51Z" level=error msg="Couldn't connect to redis instance (redis://localhost:6379)"
time="2022-09-12T05:31:02Z" level=error msg="Couldn't connect to redis instance (redis://localhost:6379)"
time="2022-09-12T05:31:21Z" level=error msg="Couldn't connect to redis instance (redis://localhost:6379)"
time="2022-09-12T05:31:21Z" level=error msg="Couldn't connect to redis instance (redis://localhost:6379)"
time="2022-09-12T05:31:32Z" level=error msg="Couldn't connect to redis instance (redis://localhost:6379)"
kubectl describe pod dos-redis-node-0
(minimized)
Containers:
redis:
Image: bitnami/redis:7.0.4-debian-11-r17
Port: 6379/TCP
Host Port: 0/TCP
Command:
/bin/bash
Args:
-c
/opt/bitnami/scripts/start-scripts/start-node.sh
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Completed
Exit Code: 0
Ready: False
Restart Count: 202
Liveness: exec [sh -c /health/ping_liveness_local.sh 5] delay=20s timeout=5s period=5s #success=1 #failure=5
Readiness: exec [sh -c /health/ping_readiness_local.sh 1] delay=20s timeout=1s period=5s #success=1 #failure=5
Startup: tcp-socket :redis delay=10s timeout=5s period=10s #success=1 #failure=22
sentinel:
Image: bitnami/redis-sentinel:7.0.4-debian-11-r14
Port: 26379/TCP
Host Port: 0/TCP
Command:
/bin/bash
Args:
-c
/opt/bitnami/scripts/start-scripts/start-sentinel.sh
State: Running
Last State: Terminated
Reason: Completed
Exit Code: 0
Ready: False
Restart Count: 202
Liveness: exec [sh -c /health/ping_sentinel.sh 5] delay=20s timeout=5s period=5s #success=1 #failure=5
Readiness: exec [sh -c /health/ping_sentinel.sh 1] delay=20s timeout=1s period=5s #success=1 #failure=5
Startup: tcp-socket :redis-sentinel delay=10s timeout=5s period=10s #success=1 #failure=22
metrics:
Image: bitnami/redis-exporter:1.43.1-debian-11-r4
Port: 9121/TCP
Host Port: 0/TCP
Command:
/bin/bash
-c
if [[ -f '/secrets/redis-password' ]]; then
export REDIS_PASSWORD=$(cat /secrets/redis-password)
fi
redis_exporter
State: Running
Ready: True
Restart Count: 0
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedPreStopHook 48m (x196 over 23h) kubelet Exec lifecycle hook ([/bin/bash -c /opt/bitnami/scripts/start-scripts/prestop-redis.sh]) for Container "redis" in Pod "dos-redis-node-0_dos-dev(6b103f55-c323-4b1a-b137-9cae7ba6f081)" failed - error: command '/bin/bash -c /opt/bitnami/scripts/start-scripts/prestop-redis.sh' exited with 1: , message: "Waiting for sentinel to run failoverfor up to 20s\n"
Warning BackOff 43m (x2666 over 22h) kubelet Back-off restarting failed container
Warning Unhealthy 8m19s (x4442 over 23h) kubelet Startup probe failed: dial tcp 10.226.180.2:26379: i/o timeout
Warning Unhealthy 3m19s (x4466 over 23h) kubelet Startup probe failed: dial tcp 10.226.180.2:6379: i/o timeout
kubectl describe statefulsets.apps dos-redis-node
(minimized)
Containers:
redis:
Image: bitnami/redis:7.0.4-debian-11-r17
Port: 6379/TCP
Host Port: 0/TCP
Command:
/bin/bash
Args:
-c
/opt/bitnami/scripts/start-scripts/start-node.sh
Liveness: exec [sh -c /health/ping_liveness_local.sh 5] delay=20s timeout=5s period=5s #success=1 #failure=5
Readiness: exec [sh -c /health/ping_readiness_local.sh 1] delay=20s timeout=1s period=5s #success=1 #failure=5
Startup: tcp-socket :redis delay=10s timeout=5s period=10s #success=1 #failure=22
sentinel:
Image: bitnami/redis-sentinel:7.0.4-debian-11-r14
Port: 26379/TCP
Host Port: 0/TCP
Command:
/bin/bash
Args:
-c
/opt/bitnami/scripts/start-scripts/start-sentinel.sh
Liveness: exec [sh -c /health/ping_sentinel.sh 5] delay=20s timeout=5s period=5s #success=1 #failure=5
Readiness: exec [sh -c /health/ping_sentinel.sh 1] delay=20s timeout=1s period=5s #success=1 #failure=5
Startup: tcp-socket :redis-sentinel delay=10s timeout=5s period=10s #success=1 #failure=22
metrics:
Image: bitnami/redis-exporter:1.43.1-debian-11-r4
Port: 9121/TCP
Host Port: 0/TCP
Command:
/bin/bash
-c
if [[ -f '/secrets/redis-password' ]]; then
export REDIS_PASSWORD=$(cat /secrets/redis-password)
fi
redis_exporter
Volume Claims: <none>
Events: <none>
Hi @jgkirschbaum,
Could you please provide more information about your environment? What CNI are you using? Have you omitted any additional value from the values.yaml
provided?
I tried to reproduce the issue using Calico on a fresh install but I wasn't able to get the same behavior.
We are running the VMware TKGI distribution with k8s v1.23.7 on premises. TKGI uses NSX-T CNI from VMware. Tested it also on k8s v1.21.9 with no success.
Omitted is (was) only the following part:
global:
imageRegistry: "our-private-registry"
imagePullSecrets: ["pullsecret"]
image:
pullPolicy: "Always"
debug: true
What I figured out is, that the only difference in the deployment is the ingress:
part in the generated network-policy, but this seems not to be suspicious.
It is very hard to determine what could be the root cause of the issue, so in case it helps with troubleshooting, I would like to share here my thoughts about what could be the cause:
networkPolicy.allowExternal: true
the Redis chart is installed successfully, but with networkPolicy.allowExternal: false
. As you mentioned, the only difference caused by this value is the ingress
section of the networkPolicy.Warning Unhealthy 8m19s (x4442 over 23h) kubelet Startup probe failed: dial tcp 10.226.180.2:26379: i/o timeout
Warning Unhealthy 3m19s (x4466 over 23h) kubelet Startup probe failed: dial tcp 10.226.180.2:6379: i/o timeout
It would be super odd, but I'm thinking that maybe with the networkPolicy ingress restricted, Kubernetes is not being able to properly execute the probes.
To discard this hypothesis, could you please try to install the chart with all probes disabled? Once installed, it would be helpful to know:
That troubleshooting may give us some additional info on what could be the cause of the problem.
Starting Redis with all startup probes disabled would form a Redis Sentinel environment.
networkPolicy:
allowExternal: false
replica:
startupProbe:
enabled: false
sentinel:
startupProbe:
enabled: false
If one of the two startup probes is not disabled, Redis Sentinel won't start up.
Starting a debug container with the configuration
apiVersion: v1
kind: Pod
metadata:
labels:
app: dos-redis-client
dos-redis-client: "true"
name: dos-redis-client
namespace: dos-ig1
spec:
imagePullSecrets:
- name: pullsecret
serviceAccountName: dos-redis
terminationGracePeriodSeconds: 1
containers:
- image: bitnami/redis:7.0.4-debian-11-r17
imagePullPolicy: Always
name: dos-redis-client
command:
- sleep
args:
- infinity
env:
- name: REDISCLI_AUTH
value: "secret"
resources:
limits:
memory: 64Mi
requests:
cpu: 100m
memory: 64Mi
securityContext:
allowPrivilegeEscalation: false
runAsGroup: 10101
runAsNonRoot: true
runAsUser: 10101
securityContext:
fsGroup: 10101
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: dos-redis-client
labels:
app.kubernetes.io/managed-by: kirscju
spec:
egress:
- ports:
- port: 53
protocol: UDP
- ports:
- port: 6379
protocol: TCP
- port: 26379
protocol: TCP
to:
- podSelector:
matchLabels:
app.kubernetes.io/instance: dos-redis
app.kubernetes.io/name: redis
podSelector:
matchLabels:
app: dos-redis-client
policyTypes:
- Egress
shows the following:
redis-cli -h dos-redis-node-0.dos-redis-headless.dos-ig1.svc.cluster.local -p 6379 info replication
role:master
connected_slaves:2
slave0:ip=dos-redis-node-1.dos-redis-headless.dos-ig1.svc.cluster.local,port=6379,state=online,offset=237117,lag=1
slave1:ip=dos-redis-node-2.dos-redis-headless.dos-ig1.svc.cluster.local,port=6379,state=online,offset=237117,lag=1
redis-cli -h dos-redis-node-1|2.dos-redis-headless.dos-ig1.svc.cluster.local -p 6379 info replication
role:slave
master_host:dos-redis-node-0.dos-redis-headless.dos-ig1.svc.cluster.local
master_port:6379
master_link_status:up
master_last_io_seconds_ago:0
master_sync_in_progress:0
slave_read_repl_offset:246189
slave_repl_offset:246189
slave_priority:100
slave_read_only:1
replica_announced:1
connected_slaves:0
master_failover_state:no-failover
redis-cli -h dos-redis-node-0|1|2.dos-redis-headless.dos-ig1.svc.cluster.local -p 26379 info sentinel
# Sentinel
sentinel_masters:1
sentinel_tilt:0
sentinel_tilt_since_seconds:-1
sentinel_running_scripts:0
sentinel_scripts_queue_length:0
sentinel_simulate_failure_flags:0
master0:name=mymaster,status=ok,address=dos-redis-node-0.dos-redis-headless.dos-ig1.svc.cluster.local:6379,slaves=2,sentinels=3
Hi @jgkirschbaum,
Thank you very much for the detailed response.
It is weird that enable/disable is causing the issue, in addition to the i/o timeout
on the TcpSocket probe from the previous pod describe output.
Warning Unhealthy 8m19s (x4442 over 23h) kubelet Startup probe failed: dial tcp 10.226.180.2:26379: i/o timeout Warning Unhealthy 3m19s (x4466 over 23h) kubelet Startup probe failed: dial tcp 10.226.180.2:6379: i/o timeout
I'm not sure if this could be caused by the chart itself or an issue with the CNI plugin, as the probes requests shouldn't be blocked.
I suggest trying redis/sentinel.customStartupProbe
to set a different Probe condition. Meanwhile, I will discuss this issue internally.
@migruiz4 thank you for your effort, your feedback and your thoughts regarding debugging the startup issue.
Meanwhile I did a little further investigation and I'd like to share my results. As you mentioned I tried it with redis|sentinel.customStartupProbe
and the following quick hack as configuration:
replica:
startupProbe:
enabled: false
customStartupProbe:
failureThreshold: 22
initialDelaySeconds: 10
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 5
exec:
command:
- "bash"
- "-c"
- "REDISCLI_AUTH=secret timeout 1 redis-cli -h 127.0.0.1 -p 6379 ping"
sentinel:
startupProbe:
enabled: false
customStartupProbe:
failureThreshold: 22
initialDelaySeconds: 10
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 5
exec:
command:
- "bash"
- "-c"
- "REDISCLI_AUTH=secret timeout 1 redis-cli -h 127.0.0.1 -p 26379 ping"
Everything works like a charme, all pods start up and form a Redis Sentinel environment.
Replacing the exec
part in the startup probe like the helm chart does with
tcpSocket:
port: 6379|26379
the pods won't startup.
In the k8s pod configuration documention and k8s pod lifecycle documention I found
For a TCP probe, the kubelet makes the probe connection at the node, not in the pod, which means that you can not use a service name in the host parameter since the kubelet is unable to resolve it.
When a Container lifecycle management hook is called, the Kubernetes management system executes the handler according to the hook action,
httpGet
andtcpSocket
are executed by the kubelet process, andexec
is executed in the container.
It's a bit unclear to me, but could it be, that the network policy generated by the helm chart with networkPolicy.allowExternal: false
doesn't allow the kubelet to connect to the pod, i.e. containers?
Additional information.
I've set up a stripped down scenario with a nginx pod and the corresponding network policies, so it looks like
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
labels:
app.kubernetes.io/instance: dos-redis-proof
app.kubernetes.io/name: redis-proof
name: dos-redis-proof
spec:
egress:
- ports:
- port: 53
protocol: UDP
ingress:
- from:
- podSelector:
matchLabels:
dos-redis-client: "true"
- podSelector:
matchLabels:
app.kubernetes.io/instance: dos-redis-proof
app.kubernetes.io/name: redis-proof
ports:
- port: 8080
protocol: TCP
podSelector:
matchLabels:
app.kubernetes.io/instance: dos-redis-proof
app.kubernetes.io/name: redis-proof
policyTypes:
- Ingress
- Egress
and for the pod
apiVersion: v1
kind: Pod
metadata:
name: dos-redis-proof
labels:
app: dos-redis-proof
app.kubernetes.io/instance: dos-redis-proof
app.kubernetes.io/name: redis-proof
spec:
imagePullSecrets:
- name: pullsecret
serviceAccountName: dos-redis-proof
terminationGracePeriodSeconds: 1
containers:
- image: nginxinc/nginx-unprivileged:1.23.1
imagePullPolicy: Always
name: dos-redis-proof
resources:
limits:
memory: 64Mi
requests:
cpu: 100m
memory: 64Mi
startupProbe:
exec:
command:
- bash
- -c
- curl -sSo /dev/null http://localhost:8080/
securityContext:
allowPrivilegeEscalation: false
runAsGroup: 83001
runAsNonRoot: true
runAsUser: 31127
securityContext:
fsGroup: 83001
Everything works as expected.
Changing the startup probe to
startupProbe:
tcpSocket:
port: 8080
the pod will never startup.
I draw the conclusion that the k8s pod configuration documention and k8s pod lifecycle documention are right
For a TCP probe, the kubelet makes the probe connection at the node, not in the pod, which means that you can not use a service name in the host parameter since the kubelet is unable to resolve it.
When a Container lifecycle management hook is called, the Kubernetes management system executes the handler according to the hook action,
httpGet
andtcpSocket
are executed by the kubelet process, andexec
is executed in the container.
so that the kubelet can't reach the pod because of the network policy.
Seems to be fixed in #12428
Thank you very much for your patience in troubleshooting and detailed responses that helped us resolve this issue.
The latest version of the bitnami/redis
chart already includes the fix replacing tcpSocket
.
Could you please confirm it is working without workarounds before we close this issue?
Version 17.2.0
works like a charm. Thank you for fixing.
Thank you for your feedback!
Name and Version
bitnami/redis 17.1.4
What steps will reproduce the bug?
helm upgrade --install dos-redis redis --values redis.yaml
Are you using any custom parameters or values?
redis.yaml
What is the expected behavior?
Redis should form a 3-node HA-configuration.
What do you see instead?
First pod will never start and hang forever. It's the sentinel container.
Additional information
When changing the
networkPolicy
part ofredis.yaml
file toeverything works fine.