Closed faradawn closed 2 years ago
For the convenience of the Seagate development team, this issue has been mirrored in a private Seagate Jira Server: https://jts.seagate.com/browse/CORTX-31327. Note that community members will not be able to access that Jira server but that is not a problem since all activity in that Jira mirror will be copied into this GitHub issue.
Thanks for opening this issue and digging into CORTX on CRI-O! Most of our work has been done with Docker as the underlying container runtime, but that shouldn't be a major difference from the Kubernetes point of view. More than likely we are hitting an issue in the setup of the cluster or other assumptions that are going on with some pre-reqs. As it stands, it looks like the Rancher local-path-provisioner that we use for dynamic local provisioning is failing. So we'll start there to debug...
prereq-deploy-cortx-cloud.sh
script, you are documenting that you pass /dev/sda
to the script as the disk you want to use for file-system storage. /mnt/fs-local-volume/local-path-provisioner
and mounts /mnt/fs-local-volume
to the device passed in to the script (ie your /dev/sda
parameter).kubectl get pods -n local-path-storage
and drop the output here?
Hi Rick,
Thanks so much for such a detailed reply!
/dev/sda
was not in use. The main disk was /dev/sdq
for all nodes, but thanks for taking a look! Here is the disk layout:
[root@node-1 k8_cortx_cloud]# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 1.8T 0 disk /mnt/fs-local-volume
sdb 8:16 0 1.8T 0 disk
sdc 8:32 0 1.8T 0 disk
sdd 8:48 0 1.8T 0 disk
sde 8:64 0 1.8T 0 disk
sdf 8:80 0 1.8T 0 disk
sdg 8:96 0 1.8T 0 disk
sdh 8:112 0 1.8T 0 disk
sdi 8:128 0 1.8T 0 disk
sdj 8:144 0 1.8T 0 disk
sdk 8:160 0 1.8T 0 disk
sdl 8:176 0 1.8T 0 disk
sdm 8:192 0 1.8T 0 disk
sdn 8:208 0 1.8T 0 disk
sdo 8:224 0 1.8T 0 disk
sdp 8:240 0 1.8T 0 disk
sdq 65:0 0 372.6G 0 disk
└─sdq1 65:1 0 372.6G 0 part /
kubectl get pods -n local-path-storage
:
[root@node-1 k8_cortx_cloud]# kubectl get pods -n local-path-storage
NAME READY STATUS RESTARTS AGE
local-path-provisioner-756898894-jgxf5 0/1 CrashLoopBackOff 5 (2m37s ago) 8m35s
The below was the deployment error message:
# Deploy CORTX Local Block Storage
######################################################
NAME: cortx-data-blk-data-default
LAST DEPLOYED: Wed May 18 04:47:53 2022
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None
########################################################
# Generating CORTX Pod Machine IDs
########################################################
######################################################
# Deploy CORTX
######################################################
W0518 04:47:56.484842 39300 warnings.go:70] policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
W0518 04:47:56.540534 39300 warnings.go:70] policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
Error: INSTALLATION FAILED: timed out waiting for the condition
Here is a desibption of the pod:
[root@node-1 k8_cortx_cloud]# kubectl describe pod -n local-path-storage
Name: local-path-provisioner-756898894-jgxf5
Namespace: local-path-storage
...
Status: Running
IP: 10.52.247.1
IPs:
IP: 10.52.247.1
Controlled By: ReplicaSet/local-path-provisioner-756898894
Containers:
local-path-provisioner:
Container ID: cri-o://0358fbd6b777b6a25bd2b58dc7bc4b91b453857a1e1d34ddbb455d1e42f35e50
Image: ghcr.io/seagate/local-path-provisioner:v0.0.20
Image ID: ghcr.io/seagate/local-path-provisioner@sha256:999cd4cf1d5dbe27d331413fd0c313d30ace67607c962d1c94da03d1b568da95
Port: <none>
Host Port: <none>
Command:
local-path-provisioner
--debug
start
--config
/etc/config/config.json
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 1
Started: Wed, 18 May 2022 04:56:42 +0000
Finished: Wed, 18 May 2022 04:57:12 +0000
Ready: False
Restart Count: 6
Environment:
POD_NAMESPACE: local-path-storage (v1:metadata.namespace)
Mounts:
/etc/config/ from config-volume (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-59jk8 (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
config-volume:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: local-path-config
Optional: false
kube-api-access-59jk8:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 12m default-scheduler Successfully assigned local-path-storage/local-path-provisioner-756898894-jgxf5 to node-2
Normal Pulling 12m kubelet Pulling image "ghcr.io/seagate/local-path-provisioner:v0.0.20"
Normal Pulled 12m kubelet Successfully pulled image "ghcr.io/seagate/local-path-provisioner:v0.0.20" in 2.096651036s
Normal Pulled 9m17s (x4 over 12m) kubelet Container image "ghcr.io/seagate/local-path-provisioner:v0.0.20" already present on machine
Normal Created 9m16s (x5 over 12m) kubelet Created container local-path-provisioner
Normal Started 9m16s (x5 over 12m) kubelet Started container local-path-provisioner
Warning BackOff 2m39s (x30 over 11m) kubelet Back-off restarting failed container
If there is anything I could do, please let me know! Thanks for helping out again!
That's great that your disk layout should be fine with the prereq script.
Can you either paste or upload in an attachment the logs from the crash-looping local path pod? That will help us determine why it's crashlooping. The fix could be related to that specific Pod's container image and cri-o as the runtime; or something completely unrelated.
You can get them via the following command:
kubectl logs -n local-path-storage local-path-provisioner-756898894-jgxf5
You can optionally at the -f
flag to follow the entirety of the log until it terminates again.
Hi Rick,
Thank you for providing the command to collect the logs for the local-path-provisioner pod! Running the command I got
[root@node-1 cc]# kubectl logs -n local-path-storage local-path-provisioner-756898894-jgxf5
time="2022-05-18T15:04:14Z" level=fatal msg="Error starting daemon: Cannot start Provisioner: failed to get Kubernetes server version: Get \"https://10.96.0.1:443/version?timeout=32s\": dial tcp 10.96.0.1:443: i/o timeout"
Here are all the pods:
[root@node-1 cc]# kubectl get pods --all-namespaces -o wide
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
default cortx-consul-client-pjvv5 0/1 Running 0 10h 10.52.84.129 node-1 <none> <none>
default cortx-consul-client-rshp8 0/1 Running 0 10h 10.52.247.2 node-2 <none> <none>
default cortx-consul-server-0 0/1 Pending 0 10h <none> <none> <none> <none>
default cortx-consul-server-1 0/1 Pending 0 10h <none> <none> <none> <none>
default cortx-kafka-0 0/1 Pending 0 10h <none> <none> <none> <none>
default cortx-kafka-1 0/1 Pending 0 10h <none> <none> <none> <none>
default cortx-zookeeper-0 0/1 Pending 0 10h <none> <none> <none> <none>
default cortx-zookeeper-1 0/1 Pending 0 10h <none> <none> <none> <none>
kube-system calico-kube-controllers-6b77fff45-9bw2g 1/1 Running 0 10h 10.85.0.4 node-1 <none> <none>
kube-system calico-node-fsn4h 1/1 Running 0 10h 10.52.3.7 node-2 <none> <none>
kube-system calico-node-xxc59 1/1 Running 0 10h 10.52.2.136 node-1 <none> <none>
kube-system coredns-64897985d-25zzk 1/1 Running 0 10h 10.85.0.2 node-1 <none> <none>
kube-system coredns-64897985d-6g586 1/1 Running 0 10h 10.85.0.3 node-1 <none> <none>
kube-system etcd-node-1 1/1 Running 0 10h 10.52.2.136 node-1 <none> <none>
kube-system kube-apiserver-node-1 1/1 Running 0 10h 10.52.2.136 node-1 <none> <none>
kube-system kube-controller-manager-node-1 1/1 Running 0 10h 10.52.2.136 node-1 <none> <none>
kube-system kube-proxy-86ldt 1/1 Running 0 10h 10.52.2.136 node-1 <none> <none>
kube-system kube-proxy-jr8ww 1/1 Running 0 10h 10.52.3.7 node-2 <none> <none>
kube-system kube-scheduler-node-1 1/1 Running 0 10h 10.52.2.136 node-1 <none> <none>
local-path-storage local-path-provisioner-756898894-jgxf5 0/1 CrashLoopBackOff 113 (3m53s ago) 10h 10.52.247.1 node-2 <none> <none>
And here is a description of the local-path pod:
[root@node-1 cc]# kubectl describe pod -n local-path-storage local-path-provisioner-756898894-jgxf5
Name: local-path-provisioner-756898894-jgxf5
Namespace: local-path-storage
Priority: 0
Node: node-2/10.52.3.7
Start Time: Wed, 18 May 2022 04:47:53 +0000
Labels: app=local-path-provisioner
pod-template-hash=756898894
Annotations: cni.projectcalico.org/containerID: e7ed79a372ffa18fc211a61edf70ece02311d6adda1f914818bb4701179686eb
cni.projectcalico.org/podIP: 10.52.247.1/32
cni.projectcalico.org/podIPs: 10.52.247.1/32
Status: Running
IP: 10.52.247.1
IPs:
IP: 10.52.247.1
Controlled By: ReplicaSet/local-path-provisioner-756898894
Containers:
local-path-provisioner:
Container ID: cri-o://6035602f2d694ca483e3b81f0e80fa58313e719ad49fea04a730d694e05b6fa9
Image: ghcr.io/seagate/local-path-provisioner:v0.0.20
Image ID: ghcr.io/seagate/local-path-provisioner@sha256:999cd4cf1d5dbe27d331413fd0c313d30ace67607c962d1c94da03d1b568da95
Port: <none>
Host Port: <none>
Command:
local-path-provisioner
--debug
start
--config
/etc/config/config.json
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 1
Started: Wed, 18 May 2022 15:03:44 +0000
Finished: Wed, 18 May 2022 15:04:14 +0000
Ready: False
Restart Count: 114
Environment:
POD_NAMESPACE: local-path-storage (v1:metadata.namespace)
Mounts:
/etc/config/ from config-volume (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-59jk8 (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
config-volume:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: local-path-config
Optional: false
kube-api-access-59jk8:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Pulled 6m22s (x113 over 10h) kubelet Container image "ghcr.io/seagate/local-path-provisioner:v0.0.20" already present on machine
Warning BackOff 88s (x2564 over 10h) kubelet Back-off restarting failed container
If there is anything I could do, please let me know!
Based off of that error message in the local-path-provisioner pod logs, as well as all of the Consul Client pods not actually running (since they are 0/1), this would point me in the direction of a networking issue with your cluster as a whole - which means we now have to debug Calico.
Can you run the following command to see what version of Calico you deployed via your tutorial docs?
curl https://docs.projectcalico.org/manifests/calico.yaml -O
kubectl apply -f calico.yaml
You can use the following command to figure that out: (there are less verbose ways of doing this, but this does it all in one line through kubectl)
kubectl get pods -n kube-system -l k8s-app=calico-kube-controllers -o jsonpath="{.items[0].spec.containers[0].image}"
We've seen recent issues with Calico v3.22.x and have moved the majority of our deployments to v3.23.
If you are on v3.23, then the next step is probably due to the default configuration of Calico in your environment.
Similar to how the CORTX (reference) prereq scripts assume certain layouts, the Calico default configs assume certain network interface layouts so we may need to configure the Calico deployment to select a network interface that can all talk to one another. You can reference https://projectcalico.docs.tigera.io/networking/ip-autodetection for the specific directions under the Manifest tab (since that is the installation method you are using).
If this is the case, you'll want to grab any of the logs from the calico-node-xxyy
pods in the kube-system
namespace and we can look for errors on which pods are selecting which network interfaces to use for communication and if there's a mismatch between them being able to talk to each other.
Hi Rick,
Thanks for kindly diagnosing the problem again!
The version of Calico seemed to be v3.23.1:
[root@node-1 cc]# kubectl get pods -n kube-system -l k8s-app=calico-kube-controllers -o jsonpath="{.items[0].spec.containers[0].image}"
docker.io/calico/kube-controllers:v3.23.1
Here are the logs for the Calico controller pod and the Calico node pod for two nodes.
Tried to look through the logs, but didn't find many error messages. Maybe there were IP assigning problems that I didn't catch!
At last, thank for providing the Calico IP config documentation. I remember deploying CORTX with Docker successfully while using the default Calico configuration. Maybe I can try manual configuration next time with CRI-O? I can keep reading the documentation!
Thanks again for the assistance!
Best, Faradawn
There were underlying issues here caused by the initial creation of the cluster and the --pod-network-cidr
provided overlapping with the physical network the Kubernetes nodes were on, which causes issues due to Calico's inability to detect what is Pod traffic versus what is Node traffic for IPIP routing.
Following the expected default values as documented by Calico via https://projectcalico.docs.tigera.io/getting-started/kubernetes/quickstart#create-a-single-host-kubernetes-cluster should produce the proper results (when running kubeadm --pod-network-cidr=192.168.0.0/16
).
Closing this issue as resolved.
Confirmed! Thank you Rick!
Deployment of CORTX 0.6.0 on Kubernetes v1.23.0 with CRI-O v1.23 was successful!
Best, Farasdawn
Patrick Hession commented in Jira Server:
.
Patrick Hession commented in Jira Server:
.
Patrick Hession commented in Jira Server:
.
Problem
Deployed Kubernetes v1.23 with CRI-O as the container runtime. But when deploying the latest CORTX, all the pods failed to be scheduled due the "VolumeBining" time out error. Search for a while but still could resolve. Any suggestion would be appreciated!
Expected behavior
Should schedule the pods successfully.
How to reproduce
Centos 7, Kubernetes and CRI-O both v1.23, CORTX 0.5.0 (latest).
The complete installation guide is here.
CORTX on Kubernetes version
v0.5.0
Deployment information
Kubernetes version: 1.23 kubectl version: 1.23
Solution configuration file YAML
Logs
logs-cortx.zip
Additional information
All the pods are not ready
Checking the log for consul pod, the error seemed to be VoumeBining causing causing the scheduling to fail
The same error appeared kafka pod, and others