Closed ahjing99 closed 6 months ago
The pod cannot recover after inject io fault
`kbcli fault io errno cluster-oqroov-redis-0 --ns-fault=default --volume-path=/data --errno=28 --duration=2m`
IOChaos io-chaos-jlk9n created
k describe pod cluster-oqroov-redis-0
Name: cluster-oqroov-redis-0
Namespace: default
Priority: 0
Node: gke-yjtest-default-pool-47e27321-h6tl/10.128.15.203
Start Time: Wed, 13 Sep 2023 09:49:31 +0800
Labels: app.kubernetes.io/component=redis
app.kubernetes.io/instance=cluster-oqroov
app.kubernetes.io/managed-by=kubeblocks
app.kubernetes.io/name=redis
app.kubernetes.io/version=redis-7.0.6
apps.kubeblocks.io/component-name=redis
apps.kubeblocks.io/workload-type=Replication
controller-revision-hash=cluster-oqroov-redis-5b74946954
kubeblocks.io/role=secondary
rsm.workloads.kubeblocks.io/access-mode=Readonly
statefulset.kubernetes.io/pod-name=cluster-oqroov-redis-0
Annotations: apps.kubeblocks.io/component-replicas: 2
apps.kubeblocks.io/last-role-changed-event-timestamp: 2023-09-13T01:52:16Z
rs.apps.kubeblocks.io/primary: cluster-oqroov-redis-1
Status: Running
IP: 10.104.2.112
IPs:
IP: 10.104.2.112
Controlled By: StatefulSet/cluster-oqroov-redis
Init Containers:
role-agent-installer:
Container ID: containerd://84064adfa884ccf1e192ed9455fc691776d8982239605f52481b9f3f4369a73a
Image: msoap/shell2http:1.16.0
Image ID: docker.io/msoap/shell2http@sha256:a20bdde2f679de2cba6bf3d9f470489c7836d4d0d28232a2b295450809cd43ef
Port: <none>
Host Port: <none>
Command:
cp
/app/shell2http
/role-probe/agent
State: Terminated
Reason: Completed
Exit Code: 0
Started: Wed, 13 Sep 2023 09:49:40 +0800
Finished: Wed, 13 Sep 2023 09:49:40 +0800
Ready: True
Restart Count: 0
Environment: <none>
Mounts:
/role-probe from role-agent (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-8lmrv (ro)
Containers:
redis:
Container ID: containerd://5952fc658ed832810e3d1aa0c1c20dd803c58b116e37d6ccbd259bf1e8a718e2
Image: registry.cn-hangzhou.aliyuncs.com/apecloud/redis-stack-server:7.0.6-RC8
Image ID: registry.cn-hangzhou.aliyuncs.com/apecloud/redis-stack-server@sha256:511808b267ab8d800283604ef5c01f4fe94792bfb746bb6dba236cc29ff5495b
Port: 6379/TCP
Host Port: 0/TCP
Command:
/scripts/redis-start.sh
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 1
Started: Wed, 13 Sep 2023 10:12:44 +0800
Finished: Wed, 13 Sep 2023 10:12:44 +0800
Ready: False
Restart Count: 8
Limits:
cpu: 500m
memory: 1Gi
Requests:
cpu: 500m
memory: 1Gi
Readiness: exec [sh -c /scripts/redis-ping.sh 1] delay=10s timeout=1s period=5s #success=1 #failure=5
Environment Variables from:
cluster-oqroov-redis-env ConfigMap Optional: false
cluster-oqroov-redis-rsm-env ConfigMap Optional: false
Environment:
KB_POD_NAME: cluster-oqroov-redis-0 (v1:metadata.name)
KB_POD_UID: (v1:metadata.uid)
KB_NAMESPACE: default (v1:metadata.namespace)
KB_SA_NAME: (v1:spec.serviceAccountName)
KB_NODENAME: (v1:spec.nodeName)
KB_HOST_IP: (v1:status.hostIP)
KB_POD_IP: (v1:status.podIP)
KB_POD_IPS: (v1:status.podIPs)
KB_HOSTIP: (v1:status.hostIP)
KB_PODIP: (v1:status.podIP)
KB_PODIPS: (v1:status.podIPs)
KB_CLUSTER_NAME: cluster-oqroov
KB_COMP_NAME: redis
KB_CLUSTER_COMP_NAME: cluster-oqroov-redis
KB_CLUSTER_UID_POSTFIX_8: 01b62888
KB_POD_FQDN: $(KB_POD_NAME).$(KB_CLUSTER_COMP_NAME)-headless.$(KB_NAMESPACE).svc
REDIS_REPL_USER: kbreplicator
REDIS_REPL_PASSWORD: <set to the key 'password' in secret 'cluster-oqroov-conn-credential'> Optional: false
REDIS_DEFAULT_PASSWORD: <set to the key 'password' in secret 'cluster-oqroov-conn-credential'> Optional: false
REDIS_SENTINEL_USER: $(REDIS_REPL_USER)-sentinel
REDIS_SENTINEL_PASSWORD: <set to the key 'password' in secret 'cluster-oqroov-conn-credential'> Optional: false
REDIS_ARGS: --requirepass $(REDIS_PASSWORD)
Mounts:
/data from data (rw)
/etc/conf from redis-config (rw)
/etc/redis from redis-conf (rw)
/kb-podinfo from pod-info (rw)
/scripts from scripts (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-8lmrv (ro)
metrics:
Container ID: containerd://14f7936053e3031c1968f5e6d000145f5019fa2f8f19c5689f7c7b2682a01f95
Image: registry.cn-hangzhou.aliyuncs.com/apecloud/agamotto:0.1.2-beta.1
Image ID: registry.cn-hangzhou.aliyuncs.com/apecloud/agamotto@sha256:cbab349b90490807a8d5039bf01bc7e37334f20c98c7dd75bc7fc4cf9e5b10ee
Port: 9121/TCP
Host Port: 0/TCP
Command:
/bin/agamotto
--config=/opt/conf/metrics-config.yaml
State: Running
Started: Wed, 13 Sep 2023 09:49:41 +0800
Ready: True
Restart Count: 0
Limits:
cpu: 0
memory: 0
Requests:
cpu: 0
memory: 0
Environment Variables from:
cluster-oqroov-redis-env ConfigMap Optional: false
cluster-oqroov-redis-rsm-env ConfigMap Optional: false
Environment:
KB_POD_NAME: cluster-oqroov-redis-0 (v1:metadata.name)
KB_POD_UID: (v1:metadata.uid)
KB_NAMESPACE: default (v1:metadata.namespace)
KB_SA_NAME: (v1:spec.serviceAccountName)
KB_NODENAME: (v1:spec.nodeName)
KB_HOST_IP: (v1:status.hostIP)
KB_POD_IP: (v1:status.podIP)
KB_POD_IPS: (v1:status.podIPs)
KB_HOSTIP: (v1:status.hostIP)
KB_PODIP: (v1:status.podIP)
KB_PODIPS: (v1:status.podIPs)
KB_CLUSTER_NAME: cluster-oqroov
KB_COMP_NAME: redis
KB_CLUSTER_COMP_NAME: cluster-oqroov-redis
KB_CLUSTER_UID_POSTFIX_8: 01b62888
KB_POD_FQDN: $(KB_POD_NAME).$(KB_CLUSTER_COMP_NAME)-headless.$(KB_NAMESPACE).svc
ENDPOINT: localhost:6379
REDIS_USER: <set to the key 'username' in secret 'cluster-oqroov-conn-credential'> Optional: false
REDIS_PASSWORD: <set to the key 'password' in secret 'cluster-oqroov-conn-credential'> Optional: false
Mounts:
/opt/conf from redis-metrics-config (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-8lmrv (ro)
kb-checkrole:
Container ID: containerd://80244c44cdfb15e6d5976aeceefc4b63c63c352cebc4a05335cbda301b2c99b0
Image: registry.cn-hangzhou.aliyuncs.com/apecloud/kubeblocks-tools:0.7.0-alpha.8
Image ID: registry.cn-hangzhou.aliyuncs.com/apecloud/kubeblocks-tools@sha256:70fc1072a6bfd03a31e0fe83377487271e7443dc7f52d5bca1af0a203ba3b96e
Ports: 3501/TCP, 50001/TCP
Host Ports: 0/TCP, 0/TCP
Command:
lorry
--app-id
batch-sdk
--dapr-http-port
3501
--dapr-grpc-port
50001
--log-level
info
--config
/config/lorry/config.yaml
--components-path
/config/lorry/components
State: Running
Started: Wed, 13 Sep 2023 09:49:42 +0800
Ready: True
Restart Count: 0
Limits:
cpu: 0
memory: 0
Requests:
cpu: 0
memory: 0
Readiness: http-get http://:3501/v1.0/bindings/redis%3Foperation=checkRole&workloadType=Replication delay=0s timeout=1s period=2s #success=1 #failure=2
Startup: tcp-socket :3501 delay=0s timeout=1s period=10s #success=1 #failure=3
Environment Variables from:
cluster-oqroov-redis-env ConfigMap Optional: false
cluster-oqroov-redis-rsm-env ConfigMap Optional: false
Environment:
KB_POD_NAME: cluster-oqroov-redis-0 (v1:metadata.name)
KB_POD_UID: (v1:metadata.uid)
KB_NAMESPACE: default (v1:metadata.namespace)
KB_SA_NAME: (v1:spec.serviceAccountName)
KB_NODENAME: (v1:spec.nodeName)
KB_HOST_IP: (v1:status.hostIP)
KB_POD_IP: (v1:status.podIP)
KB_POD_IPS: (v1:status.podIPs)
KB_HOSTIP: (v1:status.hostIP)
KB_PODIP: (v1:status.podIP)
KB_PODIPS: (v1:status.podIPs)
KB_CLUSTER_NAME: cluster-oqroov
KB_COMP_NAME: redis
KB_CLUSTER_COMP_NAME: cluster-oqroov-redis
KB_CLUSTER_UID_POSTFIX_8: 01b62888
KB_POD_FQDN: $(KB_POD_NAME).$(KB_CLUSTER_COMP_NAME)-headless.$(KB_NAMESPACE).svc
KB_SERVICE_USER: <set to the key 'username' in secret 'cluster-oqroov-conn-credential'> Optional: false
KB_SERVICE_PASSWORD: <set to the key 'password' in secret 'cluster-oqroov-conn-credential'> Optional: false
KB_SERVICE_PORT: 6379
KB_SERVICE_ROLES: {}
KB_SERVICE_CHARACTER_TYPE: redis
KB_WORKLOAD_TYPE: Replication
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-8lmrv (ro)
action-0:
Container ID: containerd://307c8718dd83f39548ae51dca6b4e9ad09b02369b2861eeb0b54ba113a0311d8
Image: registry.cn-hangzhou.aliyuncs.com/apecloud/redis-stack-server:7.0.6-RC8
Image ID: registry.cn-hangzhou.aliyuncs.com/apecloud/redis-stack-server@sha256:511808b267ab8d800283604ef5c01f4fe94792bfb746bb6dba236cc29ff5495b
Port: <none>
Host Port: <none>
Command:
/role-probe/agent
-port
36501
-export-all-vars
-form
/role
' | tr -d '$(redis-cli --user $KB_RSM_USERNAME --pass $KB_RSM_PASSWORD --no-auth-warning info | grep role | awk -F ':' '{print $2}' | tr '[:upper:]' '[:lower:]' | tr -d '
') && if [ "master" = "$Role" ]; then echo -n "primary"; else echo -n "secondary"; fi
State: Running
Started: Wed, 13 Sep 2023 09:49:42 +0800
Ready: True
Restart Count: 0
Environment:
KB_RSM_USERNAME: <set to the key 'username' in secret 'cluster-oqroov-conn-credential'> Optional: false
KB_RSM_PASSWORD: <set to the key 'password' in secret 'cluster-oqroov-conn-credential'> Optional: false
Mounts:
/role-probe from role-agent (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-8lmrv (ro)
role-observe:
Container ID: containerd://05fd2f2619f96d6de1dd775ecb9841a11f17362666de541032e23a11ed169947
Image: apecloud/kubeblocks-role-agent:latest
Image ID: docker.io/apecloud/kubeblocks-role-agent@sha256:094c90431b37fbdae13a85b491628fb05394f00de423a5686141ec63867181c2
Port: 7373/TCP
Host Port: 0/TCP
Command:
role-agent
--port
7373
State: Running
Started: Wed, 13 Sep 2023 09:49:42 +0800
Ready: True
Restart Count: 0
Readiness: exec [/bin/grpc_health_probe -addr=localhost:7373] delay=0s timeout=1s period=2s #success=1 #failure=2
Environment:
KB_RSM_USERNAME: <set to the key 'username' in secret 'cluster-oqroov-conn-credential'> Optional: false
KB_RSM_PASSWORD: <set to the key 'password' in secret 'cluster-oqroov-conn-credential'> Optional: false
KB_RSM_ACTION_SVC_LIST: [36501]
KB_SERVICE_USER: <set to the key 'username' in secret 'cluster-oqroov-conn-credential'> Optional: false
KB_SERVICE_PASSWORD: <set to the key 'password' in secret 'cluster-oqroov-conn-credential'> Optional: false
KB_RSM_SERVICE_PORT: 6379
KB_SERVICE_PORT: 6379
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-8lmrv (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
data:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: data-cluster-oqroov-redis-0
ReadOnly: false
pod-info:
Type: DownwardAPI (a volume populated by information about the pod)
Items:
metadata.labels['kubeblocks.io/role'] -> pod-role
metadata.annotations['rs.apps.kubeblocks.io/primary'] -> primary-pod
metadata.annotations['apps.kubeblocks.io/component-replicas'] -> component-replicas
redis-metrics-config:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: cluster-oqroov-redis-redis-metrics-config
Optional: false
redis-config:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: cluster-oqroov-redis-redis-replication-config
Optional: false
scripts:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: cluster-oqroov-redis-redis-scripts
Optional: false
redis-conf:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
role-agent:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
kube-api-access-8lmrv:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: kb-data=true:NoSchedule
node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 25m default-scheduler Successfully assigned default/cluster-oqroov-redis-0 to gke-yjtest-default-pool-47e27321-h6tl
Normal SuccessfulAttachVolume 25m attachdetach-controller AttachVolume.Attach succeeded for volume "pvc-2782c97b-3626-4080-a161-f01910b4ce7c"
Normal Pulled 25m kubelet Container image "msoap/shell2http:1.16.0" already present on machine
Normal Created 25m kubelet Created container role-agent-installer
Normal Started 25m kubelet Started container role-agent-installer
Normal Created 25m kubelet Created container redis
Normal Pulled 25m kubelet Container image "registry.cn-hangzhou.aliyuncs.com/apecloud/redis-stack-server:7.0.6-RC8" already present on machine
Normal Started 25m kubelet Started container redis
Normal Pulled 25m kubelet Container image "registry.cn-hangzhou.aliyuncs.com/apecloud/agamotto:0.1.2-beta.1" already present on machine
Normal Created 25m kubelet Created container metrics
Normal Started 25m kubelet Started container metrics
Normal Pulled 25m kubelet Container image "registry.cn-hangzhou.aliyuncs.com/apecloud/kubeblocks-tools:0.7.0-alpha.8" already present on machine
Normal Created 25m kubelet Created container kb-checkrole
Warning Unhealthy 25m kubelet Readiness probe failed: error: health rpc failed: rpc error: code = Unknown desc = {"event":"Success","originalRole":"","role":"primary"}
Normal Pulled 25m kubelet Container image "registry.cn-hangzhou.aliyuncs.com/apecloud/redis-stack-server:7.0.6-RC8" already present on machine
Normal Created 25m kubelet Created container action-0
Normal Started 25m kubelet Started container action-0
Normal Pulled 25m kubelet Container image "apecloud/kubeblocks-role-agent:latest" already present on machine
Normal Started 25m kubelet Started container kb-checkrole
Normal Started 25m kubelet Started container role-observe
Normal Created 25m kubelet Created container role-observe
Normal checkRole 24m sqlchannel {"event":"Failed","message":"role check delay","operation":"checkRole","originalRole":""}
Normal checkRole 24m sqlchannel {"event":"Success","operation":"checkRole","originalRole":"","role":"primary"}
Warning Unhealthy 23m kubelet Readiness probe failed: error: health rpc failed: rpc error: code = Unknown desc = {"event":"Success","originalRole":"primary","role":"primary"}
Warning Unhealthy 22m (x5 over 23m) kubelet Readiness probe failed: MISCONF Errors writing to the AOF file: No space left on device
Warning: Using a password with '-a' or '-u' option on the command line interface may not be safe.
Normal checkRole 22m sqlchannel {"event":"Success","operation":"checkRole","originalRole":"primary","role":"secondary"}
Normal checkRole 22m sqlchannel {"event":"Failed","message":"context deadline exceeded","operation":"checkRole","originalRole":"secondary"}
Normal checkRole 22m sqlchannel {"event":"Failed","message":"context deadline exceeded","operation":"checkRole","originalRole":"secondary"}
Normal checkRole 22m sqlchannel {"event":"Failed","message":"context deadline exceeded","operation":"checkRole","originalRole":"secondary"}
Normal checkRole 21m sqlchannel {"event":"Failed","message":"context deadline exceeded","operation":"checkRole","originalRole":"secondary"}
Normal checkRole 21m sqlchannel {"event":"Failed","message":"context deadline exceeded","operation":"checkRole","originalRole":"secondary"}
Normal checkRole 21m sqlchannel {"event":"Failed","message":"context deadline exceeded","operation":"checkRole","originalRole":"secondary"}
Normal checkRole 21m sqlchannel {"event":"Failed","message":"context deadline exceeded","operation":"checkRole","originalRole":"secondary"}
Warning Unhealthy 18m (x2 over 20m) kubelet Readiness probe failed: error: health rpc failed: rpc error: code = Unknown desc = {"event":"Success","originalRole":"secondary","role":"secondary"}
Normal checkRole 18m sqlchannel {"event":"Failed","message":"context deadline exceeded","operation":"checkRole","originalRole":"secondary"}
Warning BackOff 10s (x91 over 18m) kubelet Back-off restarting failed container redis in pod cluster-oqroov-redis-0_default(d2c9746c-1aac-4301-99f8-2dc5682260c1)
k logs kubeblocks-645cc6c9bd-ppvbm >kblog.txt Defaulted container "manager" out of: manager, tools (init), datascript (init) kblog.txt
This issue has been marked as stale because it has been open for 30 days with no activity
This issue seems caused by lack of disk space "Warning Unhealthy 22m (x5 over 23m) kubelet Readiness probe failed: MISCONF Errors writing to the AOF file: No space left on device", the default disk size 5G is too small
Closing
➜ ~ kbcli version Kubernetes: v1.27.3-gke.100 KubeBlocks: 0.7.0-alpha.8 kbcli: 0.7.0-alpha.8