Closed faradawn closed 2 years ago
For the convenience of the Seagate development team, this issue has been mirrored in a private Seagate Jira Server: https://jts.seagate.com/browse/CORTX-32042. Note that community members will not be able to access that Jira server but that is not a problem since all activity in that Jira mirror will be copied into this GitHub issue.
Can you drop the output of kc get pods --all-namespaces -o wide
in here? I have a feeling that its still an underlying networking issue for some reason. As you can see there is also still one consul-client
Pod that is in 0/1
state and not completely running yet. I would imagine there is something going on with the Pods that are on the untainted master node being fine, but the Pods that are on the worker node are causing an issue (since those Pods need to route through the default/kubernetes
service for DNS and that resolves to the master node).
Hi Rick,
Here is the output of kc get pods --all-namespaces -o wide
[root@node-1 cc]# kc get pods --all-namespaces -o wide
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
calico-apiserver calico-apiserver-68444c48d5-9f7hl 1/1 Running 0 3d13h 192.168.84.129 node-1 <none> <none>
calico-apiserver calico-apiserver-68444c48d5-nhrbb 1/1 Running 0 3d13h 192.168.247.1 node-2 <none> <none>
calico-system calico-kube-controllers-69cfd64db4-gvswf 1/1 Running 0 3d13h 10.85.0.4 node-1 <none> <none>
calico-system calico-node-dfqfj 1/1 Running 0 3d13h 10.52.3.226 node-1 <none> <none>
calico-system calico-node-llqxx 1/1 Running 0 3d13h 10.52.2.98 node-2 <none> <none>
calico-system calico-typha-7c59c5d99c-flv5m 1/1 Running 0 3d13h 10.52.3.226 node-1 <none> <none>
default cortx-consul-client-7r776 1/1 Running 0 3d13h 192.168.84.130 node-1 <none> <none>
default cortx-consul-client-x6j8w 0/1 Running 0 3d13h 192.168.247.3 node-2 <none> <none>
default cortx-consul-server-0 1/1 Running 0 3d13h 192.168.247.7 node-2 <none> <none>
default cortx-consul-server-1 1/1 Running 0 3d13h 192.168.84.133 node-1 <none> <none>
default cortx-kafka-0 0/1 CrashLoopBackOff 1796 (4m37s ago) 3d13h 192.168.247.8 node-2 <none> <none>
default cortx-kafka-1 0/1 CrashLoopBackOff 1056 (2m31s ago) 3d13h 192.168.84.136 node-1 <none> <none>
default cortx-zookeeper-0 1/1 Running 0 3d13h 192.168.247.9 node-2 <none> <none>
default cortx-zookeeper-1 1/1 Running 0 3d13h 192.168.84.135 node-1 <none> <none>
kube-system coredns-64897985d-9qn5m 1/1 Running 0 3d13h 10.85.0.3 node-1 <none> <none>
kube-system coredns-64897985d-z8t5b 1/1 Running 0 3d13h 10.85.0.2 node-1 <none> <none>
kube-system etcd-node-1 1/1 Running 0 3d13h 10.52.3.226 node-1 <none> <none>
kube-system kube-apiserver-node-1 1/1 Running 0 3d13h 10.52.3.226 node-1 <none> <none>
kube-system kube-controller-manager-node-1 1/1 Running 0 3d13h 10.52.3.226 node-1 <none> <none>
kube-system kube-proxy-7hpgl 1/1 Running 0 3d13h 10.52.3.226 node-1 <none> <none>
kube-system kube-proxy-m4fcz 1/1 Running 0 3d13h 10.52.2.98 node-2 <none> <none>
kube-system kube-scheduler-node-1 1/1 Running 0 3d13h 10.52.3.226 node-1 <none> <none>
local-path-storage local-path-provisioner-756898894-bgxgk 1/1 Running 0 3d13h 192.168.247.2 node-2 <none> <none>
tigera-operator tigera-operator-7d8c9d4f67-69rlk 1/1 Running 0 3d13h 10.52.3.226 node-1 <none> <none>
Thanks for pointing out that the pods on the master node, node-1
, are fine, but those on the work node, node-2
, are having some potential DNS routing problem!
Will try to read more on problems relating to DNS routing!
Followed the procedure as such
kubeadm init --pod-network-cidr=192.168.0.0/16
join worker node
export KUBECONFIG=/etc/kubernetes/admin.conf
curl https://docs.projectcalico.org/manifests/calico.yaml -O
kubectl apply -f calico.yaml
reboot
ran prereq and deploy cortx
In the second trial, the master node was rebooted after the installation of Calico, whereas in the first trail, CORTX was deployed immediately. Finally, ran the CORTX deployment script and seemed to advance a little further this time:
[root@node-3 k8_cortx_cloud]# ./deploy-cortx-cloud.sh solution.example.yaml
Number of worker nodes detected: 2
Deployment type: standard
"hashicorp" has been added to your repositories
"bitnami" has been added to your repositories
Hang tight while we grab the latest from your chart repositories...
...Successfully got an update from the "hashicorp" chart repository
...Successfully got an update from the "bitnami" chart repository
Update Complete. ⎈Happy Helming!⎈
Saving 2 charts
Downloading consul from repo https://helm.releases.hashicorp.com
Downloading kafka from repo https://charts.bitnami.com/bitnami
Deleting outdated charts
Downloading kafka from repo https://charts.bitnami.com/bitnami [101/266]
Deleting outdated charts
Install Rancher Local Path Provisionernamespace/local-path-storage created
serviceaccount/local-path-provisioner-service-account created
clusterrole.rbac.authorization.k8s.io/local-path-provisioner-role created
clusterrolebinding.rbac.authorization.k8s.io/local-path-provisioner-bind created
deployment.apps/local-path-provisioner created
storageclass.storage.k8s.io/local-path created
configmap/local-path-config created
######################################################
# Deploy CORTX Local Block Storage
######################################################
NAME: cortx-data-blk-data-default
LAST DEPLOYED: Tue Jun 7 18:27:11 2022
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None
########################################################
# Generating CORTX Pod Machine IDs
########################################################
######################################################
# Deploy CORTX
######################################################
W0607 18:27:14.732222 20115 warnings.go:70] policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
W0607 18:27:14.798805 20115 warnings.go:70] policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
NAME: cortx
LAST DEPLOYED: Tue Jun 7 18:27:13 2022
NAMESPACE: default
STATUS: deployed
REVISION: 1
NOTES:
Thanks for installing CORTX Community Object Storage!
serviceaccount/cortx-consul-client patched
serviceaccount/cortx-consul-server patched
statefulset.apps/cortx-consul-server restarted
daemonset.apps/cortx-consul-client restarted
Wait for CORTX 3rd party to be ready..........................................................................................................................................
.........
########################################################
# Deploy CORTX Secrets
########################################################
Generated secret for kafka_admin_secret
Generated secret for consul_admin_secret
Generated secret for common_admin_secret
Generated secret for s3_auth_admin_secret
Generated secret for csm_auth_admin_secret
Generated secret for csm_mgmt_admin_secret
secret/cortx-secret created
########################################################
# Deploy CORTX Control
########################################################
NAME: cortx-control-default
LAST DEPLOYED: Tue Jun 7 18:31:17 2022
NAMESPACE: default
LAST DEPLOYED: Tue Jun 7 18:31:17 2022 [46/266]
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None
Wait for CORTX Control to be ready............................................................................................................................................
...............................................................................................................................................................error: timed ou
t waiting for the condition on deployments/cortx-control
..............................................................................................................................................................................
..........................................................................................................deployment.apps/cortx-control condition met
Deployment CORTX Control available after 580 seconds
########################################################
# Deploy CORTX Data
########################################################
NAME: cortx-data-node-3-default
LAST DEPLOYED: Tue Jun 7 18:40:58 2022
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None
NAME: cortx-data-node-4-default
LAST DEPLOYED: Tue Jun 7 18:40:59 2022
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None
Wait for CORTX Data to be ready...............................................................................................................................................
..............................................................................................................................................................................
.......................................................................................deployment.apps/cortx-data-node-4 condition met
error: timed out waiting for the condition on deployments/cortx-data-node-3
deployment.apps/cortx-data-node-3 condition met
deployment.apps/cortx-data-node-4 condition met
Deployment CORTX Data available after 405 seconds
########################################################
# Deploy CORTX Server
########################################################
NAME: cortx-server-node-3-default
LAST DEPLOYED: Tue Jun 7 18:47:45 2022
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None
NAME: cortx-server-node-4-default
LAST DEPLOYED: Tue Jun 7 18:47:46 2022
NAMESPACE: default
STATUS: deployed
REVISION: 1
NAME: cortx-server-node-3-default [0/266]
LAST DEPLOYED: Tue Jun 7 18:47:45 2022
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None
NAME: cortx-server-node-4-default
LAST DEPLOYED: Tue Jun 7 18:47:46 2022
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None
Wait for CORTX Server to be ready.............................................................................................................................................
..............................................................................................................................................................................
..............................................................................................................................................................................
..............................................................................................................timed out waiting for the condition on deployments/cortx-server-
node-3
timed out waiting for the condition on deployments/cortx-server-node-4
..............................................................................................................................................................................
..............................................................................................................................................................................
..............................................................................................................................................................................
.............................................................................timed out waiting for the condition on deployments/cortx-server-node-3
timed out waiting for the condition on deployments/cortx-server-node-4
Deployment CORTX Server timed out after 1200 seconds
Failed. Exiting script.
Then, ran get all pods:
[root@node-3 k8_cortx_cloud]# kc get pods --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
default cortx-consul-client-6vgrk 1/1 Running 0 38m
default cortx-consul-client-p59zx 1/1 Running 0 39m
default cortx-consul-server-0 1/1 Running 0 38m
default cortx-consul-server-1 1/1 Running 0 39m
default cortx-control-949fffb55-5gst7 1/1 Running 0 36m
default cortx-data-node-3-54c4fc99f4-ctwv5 3/3 Running 0 27m
default cortx-data-node-4-5cdd5c87c7-tnz5h 3/3 Running 0 27m
default cortx-kafka-0 1/1 Running 1 (40m ago) 40m
default cortx-kafka-1 1/1 Running 0 40m
default cortx-server-node-3-56fb787dc5-h9k7j 0/2 Init:0/1 0 20m
default cortx-server-node-4-6c8b5b9cb5-w4877 0/2 Init:0/1 0 20m
default cortx-zookeeper-0 1/1 Running 0 40m
default cortx-zookeeper-1 1/1 Running 0 40m
kube-system calico-kube-controllers-6b77fff45-bzr65 1/1 Running 1 160m
kube-system calico-node-nx6gd 1/1 Running 1 159m
kube-system calico-node-trgvl 1/1 Running 1 160m
kube-system coredns-64897985d-dkq6g 1/1 Running 1 164m
kube-system coredns-64897985d-fpnk4 1/1 Running 1 164m
kube-system etcd-node-3 1/1 Running 1 164m
kube-system kube-apiserver-node-3 1/1 Running 1 164m
kube-system kube-controller-manager-node-3 1/1 Running 1 164m
kube-system kube-proxy-gx7sl 1/1 Running 1 164m
kube-system kube-proxy-j6x2k 1/1 Running 2 159m
kube-system kube-scheduler-node-3 1/1 Running 1 164m
local-path-storage local-path-provisioner-756898894-fs44d 1/1 Running 0 40m
This time followed the following procedure:
The deployment seemed to stuck on the data-pod
:
-bash-4.2# kc get nodes
NAME STATUS ROLES AGE VERSION
node-5 Ready control-plane,master 46m v1.23.0
node-6 Ready <none> 21m v1.23.0
-bash-4.2# ./deploy-cortx-cloud.sh solution.example.yaml
Number of worker nodes detected: 2
Deployment type: standard
"hashicorp" has been added to your repositories
"bitnami" has been added to your repositories
Hang tight while we grab the latest from your chart repositories...
...Successfully got an update from the "hashicorp" chart repository
...Successfully got an update from the "bitnami" chart repository
Update Complete. ⎈Happy Helming!⎈
Saving 2 charts
Downloading consul from repo https://helm.releases.hashicorp.com
Downloading kafka from repo https://charts.bitnami.com/bitnami
Deleting outdated charts
Install Rancher Local Path Provisionernamespace/local-path-storage created
serviceaccount/local-path-provisioner-service-account created
clusterrole.rbac.authorization.k8s.io/local-path-provisioner-role created
clusterrolebinding.rbac.authorization.k8s.io/local-path-provisioner-bind created
deployment.apps/local-path-provisioner created
storageclass.storage.k8s.io/local-path created
configmap/local-path-config created
######################################################
# Deploy CORTX Local Block Storage
######################################################
NAME: cortx-data-blk-data-default
LAST DEPLOYED: Tue Jun 7 19:38:55 2022
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None
########################################################
# Generating CORTX Pod Machine IDs
######################################################## [55/256]
# Generating CORTX Pod Machine IDs
########################################################
######################################################
# Deploy CORTX
######################################################
W0607 19:38:58.280036 7673 warnings.go:70] policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
W0607 19:38:58.342339 7673 warnings.go:70] policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
qNAME: cortx
LAST DEPLOYED: Tue Jun 7 19:38:57 2022
NAMESPACE: default
STATUS: deployed
REVISION: 1
NOTES:
Thanks for installing CORTX Community Object Storage!
serviceaccount/cortx-consul-client patched
serviceaccount/cortx-consul-server patched
statefulset.apps/cortx-consul-server restarted
daemonset.apps/cortx-consul-client restarted
Wait for CORTX 3rd party to be ready..........................................................................................................................................
..............................................................
########################################################
# Deploy CORTX Secrets
########################################################
Generated secret for kafka_admin_secret
Generated secret for consul_admin_secret
Generated secret for common_admin_secret
Generated secret for s3_auth_admin_secret
Generated secret for csm_auth_admin_secret
Generated secret for csm_mgmt_admin_secret
secret/cortx-secret created
########################################################
# Deploy CORTX Control
########################################################
NAME: cortx-control-default
LAST DEPLOYED: Tue Jun 7 19:44:15 2022
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None
Wait for CORTX Control to be ready............................................................................................................................................
...............................................................................................................................................................error: timed ou
t waiting for the condition on deployments/cortx-control
..............................................................................................................................................................................
....................................................................................deployment.apps/cortx-control condition met
Deployment CORTX Control available after 558 seconds
########################################################
# Deploy CORTX Data
########################################################
NAME: cortx-data-node-5-default
######################################################## [0/256]
NAME: cortx-data-node-5-default
LAST DEPLOYED: Tue Jun 7 19:53:34 2022
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None
NAME: cortx-data-node-6-default
LAST DEPLOYED: Tue Jun 7 19:53:34 2022
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None
Wait for CORTX Data to be ready...............................................................................................................................................
..............................................................................................................................................................................
..............................................................................................................................................................................
............................................................................................................timed out waiting for the condition on deployments/cortx-data-node
-5
timed out waiting for the condition on deployments/cortx-data-node-6
..............................................................................................................................................................................
..............................................................................................................................................................................
..............................................................................................................................................................................
.............................................................................timed out waiting for the condition on deployments/cortx-data-node-5
timed out waiting for the condition on deployments/cortx-data-node-6
Deployment CORTX Data timed out after 1200 seconds
Failed. Exiting script.
Here is get all pods:
-bash-4.2# kc get pods --all-namespaces -o wide
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
default cortx-consul-client-m6w7q 0/1 Running 0 68m 192.168.150.76 node-5 <none> <none>
default cortx-consul-client-zgdcm 1/1 Running 0 68m 192.168.49.201 node-6 <none> <none>
default cortx-consul-server-0 0/1 Running 1 (67m ago) 67m 192.168.49.202 node-6 <none> <none>
default cortx-consul-server-1 0/1 Running 0 68m 192.168.150.75 node-5 <none> <none>
default cortx-control-5b5d458c47-7vh4z 1/1 Running 0 65m 192.168.49.204 node-6 <none> <none>
default cortx-data-node-5-84f68f846-9gcd9 3/3 Running 0 55m 192.168.150.78 node-5 <none> <none>
default cortx-data-node-6-78557bbd5-l7sql 3/3 Running 0 55m 192.168.49.206 node-6 <none> <none>
default cortx-kafka-0 1/1 Running 2 (69m ago) 70m 192.168.49.198 node-6 <none> <none>
default cortx-kafka-1 1/1 Running 2 (69m ago) 70m 192.168.150.70 node-5 <none> <none>
default cortx-zookeeper-0 1/1 Running 0 70m 192.168.49.200 node-6 <none> <none>
default cortx-zookeeper-1 1/1 Running 0 70m 192.168.150.74 node-5 <none> <none>
kube-system calico-kube-controllers-6b77fff45-fqh79 1/1 Running 1 111m 192.168.150.67 node-5 <none> <none>
kube-system calico-node-4w6nf 1/1 Running 1 111m 10.52.3.120 node-5 <none> <none>
kube-system calico-node-wjmsh 1/1 Running 0 92m 10.52.3.25 node-6 <none> <none>
kube-system coredns-64897985d-mlt6h 1/1 Running 1 117m 192.168.150.65 node-5 <none> <none>
kube-system coredns-64897985d-mpnlf 1/1 Running 1 117m 192.168.150.66 node-5 <none> <none>
kube-system etcd-node-5 1/1 Running 1 117m 10.52.3.120 node-5 <none> <none>
kube-system kube-apiserver-node-5 1/1 Running 1 117m 10.52.3.120 node-5 <none> <none>
kube-system kube-controller-manager-node-5 1/1 Running 1 117m 10.52.3.120 node-5 <none> <none>
kube-system kube-proxy-tgnwt 1/1 Running 1 117m 10.52.3.120 node-5 <none> <none>
kube-system kube-proxy-vqv2s 1/1 Running 0 92m 10.52.3.25 node-6 <none> <none>
kube-system kube-scheduler-node-5 1/1 Running 1 117m 10.52.3.120 node-5 <none> <none>
local-path-storage local-path-provisioner-756898894-nhbcz 1/1 Running 0 70m 192.168.49.193 node-6 <none> <none>
I wasn't sure if the rebooting after installing Calico helped. Yet, some problems still persisted. I can keep trying and appreciate any suggestion!
Thanks in advance!
Checking out your updated deployment flow, the issue came in to play based on the updated Calico deployment steps you were using. When you apply https://projectcalico.docs.tigera.io/manifests/custom-resources.yaml
, there's an incorrect setting there for most systems that isn't the standard default applied for most Calico environments.
The encapsulation
setting in that linked document from the Calico docs has a setting of VXLANCrossSubnet
, while the default (and most commonly used setting is) IPIP
instead. To rectify this,
default
) that was created and created a new one with encapsulation: IPIP
instead. kubectl rollout restart -n kube-system deployment/coredns
.Instead of using most of the defaults for Calico and applying them directly from the website, it would be good to use those as templates but them craft them to your specific environments in these settings and reference those in your scripted installs. If you don't have a great place to host those files, you can use https://gist.github.com/ and reference them directly in your tutorial documents.
Reference:
# This section includes base Calico installation configuration.
# For more information, see: https://projectcalico.docs.tigera.io/v3.23/reference/installation/api#operator.tigera.io/v1.Installation
apiVersion: operator.tigera.io/v1
kind: Installation
metadata:
name: default
spec:
# Configures Calico networking.
calicoNetwork:
# Note: The ipPools section cannot be modified post-install.
ipPools:
- blockSize: 26
cidr: 192.168.0.0/16
encapsulation: IPIP
natOutgoing: Enabled
nodeSelector: all()
---
# This section configures the Calico API server.
# For more information, see: https://projectcalico.docs.tigera.io/v3.23/reference/installation/api#operator.tigera.io/v1.APIServer
apiVersion: operator.tigera.io/v1
kind: APIServer
metadata:
name: default
spec: {}
Patrick Hession commented in Jira Server:
.
Patrick Hession commented in Jira Server:
.
Patrick Hession commented in Jira Server:
.
Problem
During the deployment of CORTX, Kafka pods kept crashing and looping.
Expected behavior
Should be able to deploy CORTX successfully, as Rick did it once with CRI-O.
How to reproduce
Can run the following script directly on CENTOS7
Where the link to the deployment script is here.
Thanks again for taking a look!
CORTX on Kubernetes version
v0.6.0
Deployment information
Kubernetes version: v1.23.0 kubectl version: v1.23.0
Solution configuration file YAML
Logs
All pods
Kafka pod log
Additional information
solution.example.yaml.txt