Seagate / cortx-k8s

CORTX Kubernetes Orchestration Repository
https://github.com/Seagate/cortx
Apache License 2.0
6 stars 47 forks source link

Kafka pod failed when deploying CORTX on CRI-O #274

Closed faradawn closed 2 years ago

faradawn commented 2 years ago

Problem

During the deployment of CORTX, Kafka pods kept crashing and looping.

Expected behavior

Should be able to deploy CORTX successfully, as Rick did it once with CRI-O.

How to reproduce

Can run the following script directly on CENTOS7

source <(curl -s https://raw.githubusercontent.com/faradawn/tutorials/main/linux/cortx/kube.sh)

Where the link to the deployment script is here.

Thanks again for taking a look!

CORTX on Kubernetes version

v0.6.0

Deployment information

Kubernetes version: v1.23.0 kubectl version: v1.23.0

Solution configuration file YAML

Attached below and here is a summary:
- only had node-1 and node-2
- master node is node-1, which is untainted
- storage only had sdc, sdd, sde

Logs

All pods

[root@node-1 cc]# kc get pods --all-namespaces
NAMESPACE            NAME                                       READY   STATUS             RESTARTS        AGE
calico-apiserver     calico-apiserver-68444c48d5-9f7hl          1/1     Running            0               7h38m
calico-apiserver     calico-apiserver-68444c48d5-nhrbb          1/1     Running            0               7h38m
calico-system        calico-kube-controllers-69cfd64db4-gvswf   1/1     Running            0               7h39m
calico-system        calico-node-dfqfj                          1/1     Running            0               7h39m
calico-system        calico-node-llqxx                          1/1     Running            0               7h39m
calico-system        calico-typha-7c59c5d99c-flv5m              1/1     Running            0               7h39m
default              cortx-consul-client-7r776                  1/1     Running            0               7h36m
default              cortx-consul-client-x6j8w                  0/1     Running            0               7h36m
default              cortx-consul-server-0                      1/1     Running            0               7h36m
default              cortx-consul-server-1                      1/1     Running            0               7h36m
default              cortx-kafka-0                              0/1     CrashLoopBackOff   162 (28s ago)   7h36m
default              cortx-kafka-1                              0/1     CrashLoopBackOff   99 (4m3s ago)   7h36m
default              cortx-zookeeper-0                          1/1     Running            0               7h36m
default              cortx-zookeeper-1                          1/1     Running            0               7h36m
kube-system          coredns-64897985d-9qn5m                    1/1     Running            0               7h40m
kube-system          coredns-64897985d-z8t5b                    1/1     Running            0               7h40m
kube-system          etcd-node-1                                1/1     Running            0               7h40m
kube-system          kube-apiserver-node-1                      1/1     Running            0               7h40m
kube-system          kube-controller-manager-node-1             1/1     Running            0               7h40m
kube-system          kube-proxy-7hpgl                           1/1     Running            0               7h40m
kube-system          kube-proxy-m4fcz                           1/1     Running            0               7h39m
kube-system          kube-scheduler-node-1                      1/1     Running            0               7h40m
local-path-storage   local-path-provisioner-756898894-bgxgk     1/1     Running            0               7h36m
tigera-operator      tigera-operator-7d8c9d4f67-69rlk           1/1     Running            0               7h40m

Kafka pod log

[root@node-1 k8_cortx_cloud]# kc logs cortx-kafka-0
kafka 12:13:50.75 
kafka 12:13:50.75 Welcome to the Bitnami kafka container
kafka 12:13:50.76 Subscribe to project updates by watching https://github.com/bitnami/bitnami-docker-kafka
kafka 12:13:50.76 Submit issues and feature requests at https://github.com/bitnami/bitnami-docker-kafka/issues
kafka 12:13:50.76 
kafka 12:13:50.76 INFO  ==> ** Starting Kafka setup **
kafka 12:13:50.82 WARN  ==> You set the environment variable ALLOW_PLAINTEXT_LISTENER=yes. For safety reasons, do not use this flag in a production environment.
kafka 12:13:50.83 INFO  ==> Initializing Kafka...
kafka 12:13:50.84 INFO  ==> No injected configuration files found, creating default config files
kafka 12:13:51.13 INFO  ==> Configuring Kafka for inter-broker communications with PLAINTEXT authentication.
kafka 12:13:51.14 WARN  ==> Inter-broker communications are configured as PLAINTEXT. This is not safe for production environments.
kafka 12:13:51.14 INFO  ==> Configuring Kafka for client communications with PLAINTEXT authentication.
kafka 12:13:51.14 WARN  ==> Client communications are configured using PLAINTEXT listeners. For safety reasons, do not use this in a production environment.
kafka 12:13:51.16 INFO  ==> ** Kafka setup finished! **

kafka 12:13:51.19 INFO  ==> ** Starting Kafka **
[2022-06-04 12:13:52,698] INFO Registered kafka:type=kafka.Log4jController MBean (kafka.utils.Log4jControllerRegistration$)
[2022-06-04 12:13:53,334] INFO Setting -D jdk.tls.rejectClientInitiatedRenegotiation=true to disable client-initiated TLS renegotiation (org.apache.zookeeper.common.X509Util)
[2022-06-04 12:13:53,524] INFO Registered signal handlers for TERM, INT, HUP (org.apache.kafka.common.utils.LoggingSignalHandler)
[2022-06-04 12:13:53,526] INFO starting (kafka.server.KafkaServer)
[2022-06-04 12:13:53,527] INFO Connecting to zookeeper on cortx-zookeeper (kafka.server.KafkaServer)
[2022-06-04 12:13:53,541] INFO [ZooKeeperClient Kafka server] Initializing a new session to cortx-zookeeper. (kafka.zookeeper.ZooKeeperClient)
[2022-06-04 12:13:53,545] INFO Client environment:zookeeper.version=3.6.3--6401e4ad2087061bc6b9f80dec2d69f2e3c8660a, built on 04/08/2021 16:35 GMT (org.apache.zookeeper.ZooKeeper)
[2022-06-04 12:13:53,545] INFO Client environment:host.name=cortx-kafka-0.cortx-kafka-headless.default.svc.cluster.local (org.apache.zookeeper.ZooKeeper)
[2022-06-04 12:13:53,545] INFO Client environment:java.version=11.0.14 (org.apache.zookeeper.ZooKeeper)
[2022-06-04 12:13:53,545] INFO Client environment:java.vendor=BellSoft (org.apache.zookeeper.ZooKeeper)
[2022-06-04 12:13:53,545] INFO Client environment:java.home=/opt/bitnami/java (org.apache.zookeeper.ZooKeeper)
[2022-06-04 12:13:53,545] INFO Client environment:java.class.path=/opt/bitnami/kafka/bin/../libs/activation-1.1.1.jar:/opt/bitnami/kafka/bin/../libs/aopalliance-repackaged-2.6.1.jar:/opt/bitnami/kafka/bin/../libs/argparse4j-0.7.0.jar:/opt/bitnami/kafka/bin/../libs/audience-annotations-0.5.0.jar:/opt/bitnami/kafka/bin/../libs/commons-cli-1.4.jar:/opt/bitnami/kafka/bin/../libs/commons-lang3-3.8.1.jar:/opt/bitnami/kafka/bin/../libs/connect-api-3.0.0.jar:/opt/bitnami/kafka/bin/../libs/connect-basic-auth-extension-3.0.0.jar:/opt/bitnami/kafka/bin/../libs/connect-file-3.0.0.jar:/opt/bitnami/kafka/bin/../libs/connect-json-3.0.0.jar:/opt/bitnami/kafka/bin/../libs/connect-mirror-3.0.0.jar:/opt/bitnami/kafka/bin/../libs/connect-mirror-client-3.0.0.jar:/opt/bitnami/kafka/bin/../libs/connect-runtime-3.0.0.jar:/opt/bitnami/kafka/bin/../libs/connect-transforms-3.0.0.jar:/opt/bitnami/kafka/bin/../libs/hk2-api-2.6.1.jar:/opt/bitnami/kafka/bin/../libs/hk2-locator-2.6.1.jar:/opt/bitnami/kafka/bin/../libs/hk2-utils-2.6.1.jar:/opt/bitnami/kafka/bin/../libs/jackson-annotations-2.12.3.jar:/opt/bitnami/kafka/bin/../libs/jackson-core-2.12.3.jar:/opt/bitnami/kafka/bin/../libs/jackson-databind-2.12.3.jar:/opt/bitnami/kafka/bin/../libs/jackson-dataformat-csv-2.12.3.jar:/opt/bitnami/kafka/bin/../libs/jackson-datatype-jdk8-2.12.3.jar:/opt/bitnami/kafka/bin/../libs/jackson-jaxrs-base-2.12.3.jar:/opt/bitnami/kafka/bin/../libs/jackson-jaxrs-json-provider-2.12.3.jar:/opt/bitnami/kafka/bin/../libs/jackson-module-jaxb-annotations-2.12.3.jar:/opt/bitnami/kafka/bin/../libs/jackson-module-scala_2.12-2.12.3.jar:/opt/bitnami/kafka/bin/../libs/jakarta.activation-api-1.2.1.jar:/opt/bitnami/kafka/bin/../libs/jakarta.annotation-api-1.3.5.jar:/opt/bitnami/kafka/bin/../libs/jakarta.inject-2.6.1.jar:/opt/bitnami/kafka/bin/../libs/jakarta.validation-api-2.0.2.jar:/opt/bitnami/kafka/bin/../libs/jakarta.ws.rs-api-2.1.6.jar:/opt/bitnami/kafka/bin/../libs/jakarta.xml.bind-api-2.3.2.jar:/opt/bitnami/kafka/bin/../libs/javassist-3.27.0-GA.jar:/opt/bitnami/kafka/bin/../libs/javax.servlet-api-3.1.0.jar:/opt/bitnami/kafka/bin/../libs/javax.ws.rs-api-2.1.1.jar:/opt/bitnami/kafka/bin/../libs/jaxb-api-2.3.0.jar:/opt/bitnami/kafka/bin/../libs/jersey-client-2.34.jar:/opt/bitnami/kafka/bin/../libs/jersey-common-2.34.jar:/opt/bitnami/kafka/bin/../libs/jersey-container-servlet-2.34.jar:/opt/bitnami/kafka/bin/../libs/jersey-container-servlet-core-2.34.jar:/opt/bitnami/kafka/bin/../libs/jersey-hk2-2.34.jar:/opt/bitnami/kafka/bin/../libs/jersey-server-2.34.jar:/opt/bitnami/kafka/bin/../libs/jetty-client-9.4.43.v20210629.jar:/opt/bitnami/kafka/bin/../libs/jetty-continuation-9.4.43.v20210629.jar:/opt/bitnami/kafka/bin/../libs/jetty-http-9.4.43.v20210629.jar:/opt/bitnami/kafka/bin/../libs/jetty-io-9.4.43.v20210629.jar:/opt/bitnami/kafka/bin/../libs/jetty-security-9.4.43.v20210629.jar:/opt/bitnami/kafka/bin/../libs/jetty-server-9.4.43.v20210629.jar:/opt/bitnami/kafka/bin/../libs/jetty-servlet-9.4.43.v20210629.jar:/opt/bitnami/kafka/bin/../libs/jetty-servlets-9.4.43.v20210629.jar:/opt/bitnami/kafka/bin/../libs/jetty-util-9.4.43.v20210629.jar:/opt/bitnami/kafka/bin/../libs/jetty-util-ajax-9.4.43.v20210629.jar:/opt/bitnami/kafka/bin/../libs/jline-3.12.1.jar:/opt/bitnami/kafka/bin/../libs/jopt-simple-5.0.4.jar:/opt/bitnami/kafka/bin/../libs/kafka-clients-3.0.0.jar:/opt/bitnami/kafka/bin/../libs/kafka-log4j-appender-3.0.0.jar:/opt/bitnami/kafka/bin/../libs/kafka-metadata-3.0.0.jar:/opt/bitnami/kafka/bin/../libs/kafka-raft-3.0.0.jar:/opt/bitnami/kafka/bin/../libs/kafka-server-common-3.0.0.jar:/opt/bitnami/kafka/bin/../libs/kafka-shell-3.0.0.jar:/opt/bitnami/kafka/bin/../libs/kafka-storage-3.0.0.jar:/opt/bitnami/kafka/bin/../libs/kafka-storage-api-3.0.0.jar:/opt/bitnami/kafka/bin/../libs/kafka-streams-3.0.0.jar:/opt/bitnami/kafka/bin/../libs/kafka-streams-examples-3.0.0.jar:/opt/bitnami/kafka/bin/../libs/kafka-streams-scala_2.12-3.0.0.jar:/opt/bitnami/kafka/bin/../libs/kafka-streams-test-utils-3.0.0.jar:/opt/bitnami/kafka/bin/../libs/kafka-tools-3.0.0.jar:/opt/bitnami/kafka/bin/../libs/kafka_2.12-3.0.0.jar:/opt/bitnami/kafka/bin/../libs/log4j-1.2.17.jar:/opt/bitnami/kafka/bin/../libs/lz4-java-1.7.1.jar:/opt/bitnami/kafka/bin/../libs/maven-artifact-3.8.1.jar:/opt/bitnami/kafka/bin/../libs/metrics-core-2.2.0.jar:/opt/bitnami/kafka/bin/../libs/metrics-core-4.1.12.1.jar:/opt/bitnami/kafka/bin/../libs/netty-buffer-4.1.62.Final.jar:/opt/bitnami/kafka/bin/../libs/netty-codec-4.1.62.Final.jar:/opt/bitnami/kafka/bin/../libs/netty-common-4.1.62.Final.jar:/opt/bitnami/kafka/bin/../libs/netty-handler-4.1.62.Final.jar:/opt/bitnami/kafka/bin/../libs/netty-resolver-4.1.62.Final.jar:/opt/bitnami/kafka/bin/../libs/netty-transport-4.1.62.Final.jar:/opt/bitnami/kafka/bin/../libs/netty-transport-native-epoll-4.1.62.Final.jar:/opt/bitnami/kafka/bin/../libs/netty-transport-native-unix-common-4.1.62.Final.jar:/opt/bitnami/kafka/bin/../libs/osgi-resource-locator-1.0.3.jar:/opt/bitnami/kafka/bin/../libs/paranamer-2.8.jar:/opt/bitnami/kafka/bin/../libs/plexus-utils-3.2.1.jar:/opt/bitnami/kafka/bin/../libs/reflections-0.9.12.jar:/opt/bitnami/kafka/bin/../libs/rocksdbjni-6.19.3.jar:/opt/bitnami/kafka/bin/../libs/scala-collection-compat_2.12-2.4.4.jar:/opt/bitnami/kafka/bin/../libs/scala-java8-compat_2.12-1.0.0.jar:/opt/bitnami/kafka/bin/../libs/scala-library-2.12.14.jar:/opt/bitnami/kafka/bin/../libs/scala-logging_2.12-3.9.3.jar:/opt/bitnami/kafka/bin/../libs/scala-reflect-2.12.14.jar:/opt/bitnami/kafka/bin/../libs/slf4j-api-1.7.30.jar:/opt/bitnami/kafka/bin/../libs/slf4j-log4j12-1.7.30.jar:/opt/bitnami/kafka/bin/../libs/snappy-java-1.1.8.1.jar:/opt/bitnami/kafka/bin/../libs/trogdor-3.0.0.jar:/opt/bitnami/kafka/bin/../libs/zookeeper-3.6.3.jar:/opt/bitnami/kafka/bin/../libs/zookeeper-jute-3.6.3.jar:/opt/bitnami/kafka/bin/../libs/zstd-jni-1.5.0-2.jar (org.apache.zookeeper.ZooKeeper)
[2022-06-04 12:13:53,546] INFO Client environment:java.library.path=/usr/java/packages/lib:/usr/lib64:/lib64:/lib:/usr/lib (org.apache.zookeeper.ZooKeeper)
[2022-06-04 12:13:53,546] INFO Client environment:java.io.tmpdir=/tmp (org.apache.zookeeper.ZooKeeper)
[2022-06-04 12:13:53,546] INFO Client environment:java.compiler=<NA> (org.apache.zookeeper.ZooKeeper)
[2022-06-04 12:13:53,546] INFO Client environment:os.name=Linux (org.apache.zookeeper.ZooKeeper)
[2022-06-04 12:13:53,546] INFO Client environment:os.arch=amd64 (org.apache.zookeeper.ZooKeeper)
[2022-06-04 12:13:53,546] INFO Client environment:os.version=3.10.0-1127.19.1.el7.x86_64 (org.apache.zookeeper.ZooKeeper)
[2022-06-04 12:13:53,546] INFO Client environment:user.name=1001 (org.apache.zookeeper.ZooKeeper)
[2022-06-04 12:13:53,546] INFO Client environment:user.home=/ (org.apache.zookeeper.ZooKeeper)
[2022-06-04 12:13:53,546] INFO Client environment:user.dir=/ (org.apache.zookeeper.ZooKeeper)
[2022-06-04 12:13:53,546] INFO Client environment:os.memory.free=1010MB (org.apache.zookeeper.ZooKeeper)
[2022-06-04 12:13:53,546] INFO Client environment:os.memory.max=1024MB (org.apache.zookeeper.ZooKeeper)
[2022-06-04 12:13:53,546] INFO Client environment:os.memory.total=1024MB (org.apache.zookeeper.ZooKeeper)
[2022-06-04 12:13:53,548] INFO Initiating client connection, connectString=cortx-zookeeper sessionTimeout=18000 watcher=kafka.zookeeper.ZooKeeperClient$ZooKeeperClientWatcher$@51972dc7 (org.apache.zookeeper.ZooKeeper)
[2022-06-04 12:13:53,552] INFO jute.maxbuffer value is 4194304 Bytes (org.apache.zookeeper.ClientCnxnSocket)
[2022-06-04 12:13:53,557] INFO zookeeper.request.timeout value is 0. feature enabled=false (org.apache.zookeeper.ClientCnxn)
[2022-06-04 12:13:53,559] INFO [ZooKeeperClient Kafka server] Waiting until connected. (kafka.zookeeper.ZooKeeperClient)
[2022-06-04 12:13:59,560] INFO [ZooKeeperClient Kafka server] Closing. (kafka.zookeeper.ZooKeeperClient)
[2022-06-04 12:14:13,580] ERROR Unable to resolve address: cortx-zookeeper:2181 (org.apache.zookeeper.client.StaticHostProvider)
java.net.UnknownHostException: cortx-zookeeper: Temporary failure in name resolution
    at java.base/java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method)
    at java.base/java.net.InetAddress$PlatformNameService.lookupAllHostAddr(InetAddress.java:929)
    at java.base/java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1519)
    at java.base/java.net.InetAddress$NameServiceAddresses.get(InetAddress.java:848)
    at java.base/java.net.InetAddress.getAllByName0(InetAddress.java:1509)
    at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1368)
    at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1302)
    at org.apache.zookeeper.client.StaticHostProvider$1.getAllByName(StaticHostProvider.java:88)
    at org.apache.zookeeper.client.StaticHostProvider.resolve(StaticHostProvider.java:141)
    at org.apache.zookeeper.client.StaticHostProvider.next(StaticHostProvider.java:368)
    at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1207)
[2022-06-04 12:14:13,588] WARN An exception was thrown while closing send thread for session 0x0. (org.apache.zookeeper.ClientCnxn)
java.lang.IllegalArgumentException: Unable to canonicalize address cortx-zookeeper:2181 because it's not resolvable
    at org.apache.zookeeper.SaslServerPrincipal.getServerPrincipal(SaslServerPrincipal.java:78)
    at org.apache.zookeeper.SaslServerPrincipal.getServerPrincipal(SaslServerPrincipal.java:41)
    at org.apache.zookeeper.ClientCnxn$SendThread.startConnect(ClientCnxn.java:1161)
    at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1210)
[2022-06-04 12:14:13,693] INFO Session: 0x0 closed (org.apache.zookeeper.ZooKeeper)
[2022-06-04 12:14:13,694] INFO EventThread shut down for session: 0x0 (org.apache.zookeeper.ClientCnxn)
[2022-06-04 12:14:13,695] INFO [ZooKeeperClient Kafka server] Closed. (kafka.zookeeper.ZooKeeperClient)
[2022-06-04 12:14:13,697] ERROR Fatal error during KafkaServer startup. Prepare to shutdown (kafka.server.KafkaServer)
kafka.zookeeper.ZooKeeperClientTimeoutException: Timed out waiting for connection while in state: CONNECTING
    at kafka.zookeeper.ZooKeeperClient.$anonfun$waitUntilConnected$3(ZooKeeperClient.scala:254)
    at kafka.zookeeper.ZooKeeperClient.waitUntilConnected(ZooKeeperClient.scala:250)
    at kafka.zookeeper.ZooKeeperClient.<init>(ZooKeeperClient.scala:108)
    at kafka.zk.KafkaZkClient$.apply(KafkaZkClient.scala:1981)
    at kafka.server.KafkaServer.initZkClient(KafkaServer.scala:457)
    at kafka.server.KafkaServer.startup(KafkaServer.scala:196)
    at kafka.Kafka$.main(Kafka.scala:109)
    at kafka.Kafka.main(Kafka.scala)
[2022-06-04 12:14:13,698] INFO shutting down (kafka.server.KafkaServer)
[2022-06-04 12:14:13,703] INFO App info kafka.server for 0 unregistered (org.apache.kafka.common.utils.AppInfoParser)
[2022-06-04 12:14:13,703] INFO shut down completed (kafka.server.KafkaServer)
[2022-06-04 12:14:13,703] ERROR Exiting Kafka. (kafka.Kafka$)
[2022-06-04 12:14:13,704] INFO shutting down (kafka.server.KafkaServer)

Additional information

solution.example.yaml.txt

cortx-admin commented 2 years ago

For the convenience of the Seagate development team, this issue has been mirrored in a private Seagate Jira Server: https://jts.seagate.com/browse/CORTX-32042. Note that community members will not be able to access that Jira server but that is not a problem since all activity in that Jira mirror will be copied into this GitHub issue.

osowski commented 2 years ago

Can you drop the output of kc get pods --all-namespaces -o wide in here? I have a feeling that its still an underlying networking issue for some reason. As you can see there is also still one consul-client Pod that is in 0/1 state and not completely running yet. I would imagine there is something going on with the Pods that are on the untainted master node being fine, but the Pods that are on the worker node are causing an issue (since those Pods need to route through the default/kubernetes service for DNS and that resolves to the master node).

faradawn commented 2 years ago

Hi Rick,

Here is the output of kc get pods --all-namespaces -o wide

[root@node-1 cc]# kc get pods --all-namespaces -o wide
NAMESPACE            NAME                                       READY   STATUS             RESTARTS           AGE     IP               NODE     NOMINATED NODE   READINESS GATES
calico-apiserver     calico-apiserver-68444c48d5-9f7hl          1/1     Running            0                  3d13h   192.168.84.129   node-1   <none>           <none>
calico-apiserver     calico-apiserver-68444c48d5-nhrbb          1/1     Running            0                  3d13h   192.168.247.1    node-2   <none>           <none>
calico-system        calico-kube-controllers-69cfd64db4-gvswf   1/1     Running            0                  3d13h   10.85.0.4        node-1   <none>           <none>
calico-system        calico-node-dfqfj                          1/1     Running            0                  3d13h   10.52.3.226      node-1   <none>           <none>
calico-system        calico-node-llqxx                          1/1     Running            0                  3d13h   10.52.2.98       node-2   <none>           <none>
calico-system        calico-typha-7c59c5d99c-flv5m              1/1     Running            0                  3d13h   10.52.3.226      node-1   <none>           <none>
default              cortx-consul-client-7r776                  1/1     Running            0                  3d13h   192.168.84.130   node-1   <none>           <none>
default              cortx-consul-client-x6j8w                  0/1     Running            0                  3d13h   192.168.247.3    node-2   <none>           <none>
default              cortx-consul-server-0                      1/1     Running            0                  3d13h   192.168.247.7    node-2   <none>           <none>
default              cortx-consul-server-1                      1/1     Running            0                  3d13h   192.168.84.133   node-1   <none>           <none>
default              cortx-kafka-0                              0/1     CrashLoopBackOff   1796 (4m37s ago)   3d13h   192.168.247.8    node-2   <none>           <none>
default              cortx-kafka-1                              0/1     CrashLoopBackOff   1056 (2m31s ago)   3d13h   192.168.84.136   node-1   <none>           <none>
default              cortx-zookeeper-0                          1/1     Running            0                  3d13h   192.168.247.9    node-2   <none>           <none>
default              cortx-zookeeper-1                          1/1     Running            0                  3d13h   192.168.84.135   node-1   <none>           <none>
kube-system          coredns-64897985d-9qn5m                    1/1     Running            0                  3d13h   10.85.0.3        node-1   <none>           <none>
kube-system          coredns-64897985d-z8t5b                    1/1     Running            0                  3d13h   10.85.0.2        node-1   <none>           <none>
kube-system          etcd-node-1                                1/1     Running            0                  3d13h   10.52.3.226      node-1   <none>           <none>
kube-system          kube-apiserver-node-1                      1/1     Running            0                  3d13h   10.52.3.226      node-1   <none>           <none>
kube-system          kube-controller-manager-node-1             1/1     Running            0                  3d13h   10.52.3.226      node-1   <none>           <none>
kube-system          kube-proxy-7hpgl                           1/1     Running            0                  3d13h   10.52.3.226      node-1   <none>           <none>
kube-system          kube-proxy-m4fcz                           1/1     Running            0                  3d13h   10.52.2.98       node-2   <none>           <none>
kube-system          kube-scheduler-node-1                      1/1     Running            0                  3d13h   10.52.3.226      node-1   <none>           <none>
local-path-storage   local-path-provisioner-756898894-bgxgk     1/1     Running            0                  3d13h   192.168.247.2    node-2   <none>           <none>
tigera-operator      tigera-operator-7d8c9d4f67-69rlk           1/1     Running            0                  3d13h   10.52.3.226      node-1   <none>           <none>

Thanks for pointing out that the pods on the master node, node-1, are fine, but those on the work node, node-2, are having some potential DNS routing problem!

Will try to read more on problems relating to DNS routing!

faradawn commented 2 years ago

Trial Two: node-3 and node-4

Followed the procedure as such

kubeadm init --pod-network-cidr=192.168.0.0/16
join worker node
export KUBECONFIG=/etc/kubernetes/admin.conf
curl https://docs.projectcalico.org/manifests/calico.yaml -O
kubectl apply -f calico.yaml
reboot
ran prereq and deploy cortx

In the second trial, the master node was rebooted after the installation of Calico, whereas in the first trail, CORTX was deployed immediately. Finally, ran the CORTX deployment script and seemed to advance a little further this time:

[root@node-3 k8_cortx_cloud]# ./deploy-cortx-cloud.sh solution.example.yaml

Number of worker nodes detected: 2
Deployment type: standard

"hashicorp" has been added to your repositories
"bitnami" has been added to your repositories
Hang tight while we grab the latest from your chart repositories...
...Successfully got an update from the "hashicorp" chart repository
...Successfully got an update from the "bitnami" chart repository
Update Complete. ⎈Happy Helming!⎈
Saving 2 charts
Downloading consul from repo https://helm.releases.hashicorp.com
Downloading kafka from repo https://charts.bitnami.com/bitnami
Deleting outdated charts
Downloading kafka from repo https://charts.bitnami.com/bitnami                                                                                                       [101/266]
Deleting outdated charts
Install Rancher Local Path Provisionernamespace/local-path-storage created
serviceaccount/local-path-provisioner-service-account created
clusterrole.rbac.authorization.k8s.io/local-path-provisioner-role created
clusterrolebinding.rbac.authorization.k8s.io/local-path-provisioner-bind created
deployment.apps/local-path-provisioner created
storageclass.storage.k8s.io/local-path created
configmap/local-path-config created
######################################################
# Deploy CORTX Local Block Storage                    
######################################################
NAME: cortx-data-blk-data-default
LAST DEPLOYED: Tue Jun  7 18:27:11 2022
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None
########################################################
# Generating CORTX Pod Machine IDs                      
########################################################
######################################################
# Deploy CORTX                                        
######################################################
W0607 18:27:14.732222   20115 warnings.go:70] policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
W0607 18:27:14.798805   20115 warnings.go:70] policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
NAME: cortx
LAST DEPLOYED: Tue Jun  7 18:27:13 2022
NAMESPACE: default
STATUS: deployed
REVISION: 1
NOTES:
Thanks for installing CORTX Community Object Storage!
serviceaccount/cortx-consul-client patched
serviceaccount/cortx-consul-server patched
statefulset.apps/cortx-consul-server restarted
daemonset.apps/cortx-consul-client restarted

Wait for CORTX 3rd party to be ready..........................................................................................................................................
.........

########################################################
# Deploy CORTX Secrets                                  
########################################################
Generated secret for kafka_admin_secret
Generated secret for consul_admin_secret
Generated secret for common_admin_secret
Generated secret for s3_auth_admin_secret
Generated secret for csm_auth_admin_secret
Generated secret for csm_mgmt_admin_secret
secret/cortx-secret created
########################################################
# Deploy CORTX Control                                  
########################################################
NAME: cortx-control-default
LAST DEPLOYED: Tue Jun  7 18:31:17 2022
NAMESPACE: default
LAST DEPLOYED: Tue Jun  7 18:31:17 2022                                                                                                                               [46/266]
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None

Wait for CORTX Control to be ready............................................................................................................................................
...............................................................................................................................................................error: timed ou
t waiting for the condition on deployments/cortx-control
..............................................................................................................................................................................
..........................................................................................................deployment.apps/cortx-control condition met

Deployment CORTX Control available after 580 seconds

########################################################
# Deploy CORTX Data                                     
########################################################
NAME: cortx-data-node-3-default
LAST DEPLOYED: Tue Jun  7 18:40:58 2022
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None
NAME: cortx-data-node-4-default
LAST DEPLOYED: Tue Jun  7 18:40:59 2022
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None

Wait for CORTX Data to be ready...............................................................................................................................................
..............................................................................................................................................................................
.......................................................................................deployment.apps/cortx-data-node-4 condition met
error: timed out waiting for the condition on deployments/cortx-data-node-3
deployment.apps/cortx-data-node-3 condition met
deployment.apps/cortx-data-node-4 condition met

Deployment CORTX Data available after 405 seconds

########################################################
# Deploy CORTX Server                                   
########################################################
NAME: cortx-server-node-3-default
LAST DEPLOYED: Tue Jun  7 18:47:45 2022
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None
NAME: cortx-server-node-4-default
LAST DEPLOYED: Tue Jun  7 18:47:46 2022
NAMESPACE: default
STATUS: deployed
REVISION: 1
NAME: cortx-server-node-3-default                                                                                                                                      [0/266]
LAST DEPLOYED: Tue Jun  7 18:47:45 2022
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None
NAME: cortx-server-node-4-default
LAST DEPLOYED: Tue Jun  7 18:47:46 2022
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None

Wait for CORTX Server to be ready.............................................................................................................................................
..............................................................................................................................................................................
..............................................................................................................................................................................
..............................................................................................................timed out waiting for the condition on deployments/cortx-server-
node-3
timed out waiting for the condition on deployments/cortx-server-node-4
..............................................................................................................................................................................
..............................................................................................................................................................................
..............................................................................................................................................................................
.............................................................................timed out waiting for the condition on deployments/cortx-server-node-3
timed out waiting for the condition on deployments/cortx-server-node-4

Deployment CORTX Server timed out after 1200 seconds

Failed.  Exiting script.

Then, ran get all pods:

[root@node-3 k8_cortx_cloud]# kc get pods --all-namespaces
NAMESPACE            NAME                                      READY   STATUS     RESTARTS      AGE
default              cortx-consul-client-6vgrk                 1/1     Running    0             38m
default              cortx-consul-client-p59zx                 1/1     Running    0             39m
default              cortx-consul-server-0                     1/1     Running    0             38m
default              cortx-consul-server-1                     1/1     Running    0             39m
default              cortx-control-949fffb55-5gst7             1/1     Running    0             36m
default              cortx-data-node-3-54c4fc99f4-ctwv5        3/3     Running    0             27m
default              cortx-data-node-4-5cdd5c87c7-tnz5h        3/3     Running    0             27m
default              cortx-kafka-0                             1/1     Running    1 (40m ago)   40m
default              cortx-kafka-1                             1/1     Running    0             40m
default              cortx-server-node-3-56fb787dc5-h9k7j      0/2     Init:0/1   0             20m
default              cortx-server-node-4-6c8b5b9cb5-w4877      0/2     Init:0/1   0             20m
default              cortx-zookeeper-0                         1/1     Running    0             40m
default              cortx-zookeeper-1                         1/1     Running    0             40m
kube-system          calico-kube-controllers-6b77fff45-bzr65   1/1     Running    1             160m
kube-system          calico-node-nx6gd                         1/1     Running    1             159m
kube-system          calico-node-trgvl                         1/1     Running    1             160m
kube-system          coredns-64897985d-dkq6g                   1/1     Running    1             164m
kube-system          coredns-64897985d-fpnk4                   1/1     Running    1             164m
kube-system          etcd-node-3                               1/1     Running    1             164m
kube-system          kube-apiserver-node-3                     1/1     Running    1             164m
kube-system          kube-controller-manager-node-3            1/1     Running    1             164m
kube-system          kube-proxy-gx7sl                          1/1     Running    1             164m
kube-system          kube-proxy-j6x2k                          1/1     Running    2             159m
kube-system          kube-scheduler-node-3                     1/1     Running    1             164m
local-path-storage   local-path-provisioner-756898894-fs44d    1/1     Running    0             40m

Trial Three: node-5 and node-6

This time followed the following procedure:

The deployment seemed to stuck on the data-pod:

-bash-4.2# kc get nodes
NAME     STATUS   ROLES                  AGE   VERSION
node-5   Ready    control-plane,master   46m   v1.23.0
node-6   Ready    <none>                 21m   v1.23.0
-bash-4.2# ./deploy-cortx-cloud.sh solution.example.yaml

Number of worker nodes detected: 2
Deployment type: standard

"hashicorp" has been added to your repositories
"bitnami" has been added to your repositories
Hang tight while we grab the latest from your chart repositories...
...Successfully got an update from the "hashicorp" chart repository
...Successfully got an update from the "bitnami" chart repository
Update Complete. ⎈Happy Helming!⎈
Saving 2 charts
Downloading consul from repo https://helm.releases.hashicorp.com
Downloading kafka from repo https://charts.bitnami.com/bitnami
Deleting outdated charts
Install Rancher Local Path Provisionernamespace/local-path-storage created
serviceaccount/local-path-provisioner-service-account created
clusterrole.rbac.authorization.k8s.io/local-path-provisioner-role created
clusterrolebinding.rbac.authorization.k8s.io/local-path-provisioner-bind created
deployment.apps/local-path-provisioner created
storageclass.storage.k8s.io/local-path created
configmap/local-path-config created
######################################################
# Deploy CORTX Local Block Storage                    
######################################################
NAME: cortx-data-blk-data-default
LAST DEPLOYED: Tue Jun  7 19:38:55 2022
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None
########################################################
# Generating CORTX Pod Machine IDs                      
########################################################                                                                                                              [55/256]
# Generating CORTX Pod Machine IDs                      
########################################################
######################################################
# Deploy CORTX                                        
######################################################
W0607 19:38:58.280036    7673 warnings.go:70] policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
W0607 19:38:58.342339    7673 warnings.go:70] policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
qNAME: cortx
LAST DEPLOYED: Tue Jun  7 19:38:57 2022
NAMESPACE: default
STATUS: deployed
REVISION: 1
NOTES:
Thanks for installing CORTX Community Object Storage!
serviceaccount/cortx-consul-client patched
serviceaccount/cortx-consul-server patched
statefulset.apps/cortx-consul-server restarted
daemonset.apps/cortx-consul-client restarted

Wait for CORTX 3rd party to be ready..........................................................................................................................................
..............................................................

########################################################
# Deploy CORTX Secrets                                  
########################################################
Generated secret for kafka_admin_secret
Generated secret for consul_admin_secret
Generated secret for common_admin_secret
Generated secret for s3_auth_admin_secret
Generated secret for csm_auth_admin_secret
Generated secret for csm_mgmt_admin_secret
secret/cortx-secret created
########################################################
# Deploy CORTX Control                                  
########################################################
NAME: cortx-control-default
LAST DEPLOYED: Tue Jun  7 19:44:15 2022
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None

Wait for CORTX Control to be ready............................................................................................................................................
...............................................................................................................................................................error: timed ou
t waiting for the condition on deployments/cortx-control
..............................................................................................................................................................................
....................................................................................deployment.apps/cortx-control condition met

Deployment CORTX Control available after 558 seconds

########################################################
# Deploy CORTX Data                                     
########################################################
NAME: cortx-data-node-5-default
########################################################                                                                                                               [0/256]
NAME: cortx-data-node-5-default
LAST DEPLOYED: Tue Jun  7 19:53:34 2022
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None
NAME: cortx-data-node-6-default
LAST DEPLOYED: Tue Jun  7 19:53:34 2022
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None

Wait for CORTX Data to be ready...............................................................................................................................................
..............................................................................................................................................................................
..............................................................................................................................................................................
............................................................................................................timed out waiting for the condition on deployments/cortx-data-node
-5
timed out waiting for the condition on deployments/cortx-data-node-6
..............................................................................................................................................................................
..............................................................................................................................................................................
..............................................................................................................................................................................
.............................................................................timed out waiting for the condition on deployments/cortx-data-node-5
timed out waiting for the condition on deployments/cortx-data-node-6

Deployment CORTX Data timed out after 1200 seconds

Failed.  Exiting script.

Here is get all pods:

-bash-4.2# kc get pods --all-namespaces -o wide
NAMESPACE            NAME                                      READY   STATUS    RESTARTS      AGE    IP               NODE     NOMINATED NODE   READINESS GATES
default              cortx-consul-client-m6w7q                 0/1     Running   0             68m    192.168.150.76   node-5   <none>           <none>
default              cortx-consul-client-zgdcm                 1/1     Running   0             68m    192.168.49.201   node-6   <none>           <none>
default              cortx-consul-server-0                     0/1     Running   1 (67m ago)   67m    192.168.49.202   node-6   <none>           <none>
default              cortx-consul-server-1                     0/1     Running   0             68m    192.168.150.75   node-5   <none>           <none>
default              cortx-control-5b5d458c47-7vh4z            1/1     Running   0             65m    192.168.49.204   node-6   <none>           <none>
default              cortx-data-node-5-84f68f846-9gcd9         3/3     Running   0             55m    192.168.150.78   node-5   <none>           <none>
default              cortx-data-node-6-78557bbd5-l7sql         3/3     Running   0             55m    192.168.49.206   node-6   <none>           <none>
default              cortx-kafka-0                             1/1     Running   2 (69m ago)   70m    192.168.49.198   node-6   <none>           <none>
default              cortx-kafka-1                             1/1     Running   2 (69m ago)   70m    192.168.150.70   node-5   <none>           <none>
default              cortx-zookeeper-0                         1/1     Running   0             70m    192.168.49.200   node-6   <none>           <none>
default              cortx-zookeeper-1                         1/1     Running   0             70m    192.168.150.74   node-5   <none>           <none>
kube-system          calico-kube-controllers-6b77fff45-fqh79   1/1     Running   1             111m   192.168.150.67   node-5   <none>           <none>
kube-system          calico-node-4w6nf                         1/1     Running   1             111m   10.52.3.120      node-5   <none>           <none>
kube-system          calico-node-wjmsh                         1/1     Running   0             92m    10.52.3.25       node-6   <none>           <none>
kube-system          coredns-64897985d-mlt6h                   1/1     Running   1             117m   192.168.150.65   node-5   <none>           <none>
kube-system          coredns-64897985d-mpnlf                   1/1     Running   1             117m   192.168.150.66   node-5   <none>           <none>
kube-system          etcd-node-5                               1/1     Running   1             117m   10.52.3.120      node-5   <none>           <none>
kube-system          kube-apiserver-node-5                     1/1     Running   1             117m   10.52.3.120      node-5   <none>           <none>
kube-system          kube-controller-manager-node-5            1/1     Running   1             117m   10.52.3.120      node-5   <none>           <none>
kube-system          kube-proxy-tgnwt                          1/1     Running   1             117m   10.52.3.120      node-5   <none>           <none>
kube-system          kube-proxy-vqv2s                          1/1     Running   0             92m    10.52.3.25       node-6   <none>           <none>
kube-system          kube-scheduler-node-5                     1/1     Running   1             117m   10.52.3.120      node-5   <none>           <none>
local-path-storage   local-path-provisioner-756898894-nhbcz    1/1     Running   0             70m    192.168.49.193   node-6   <none>           <none>

I wasn't sure if the rebooting after installing Calico helped. Yet, some problems still persisted. I can keep trying and appreciate any suggestion!

Thanks in advance!

osowski commented 2 years ago

Checking out your updated deployment flow, the issue came in to play based on the updated Calico deployment steps you were using. When you apply https://projectcalico.docs.tigera.io/manifests/custom-resources.yaml, there's an incorrect setting there for most systems that isn't the standard default applied for most Calico environments.

The encapsulation setting in that linked document from the Calico docs has a setting of VXLANCrossSubnet, while the default (and most commonly used setting is) IPIP instead. To rectify this,

  1. I deleted the previous Installation (named default) that was created and created a new one with encapsulation: IPIP instead.
  2. Once that was up and running, I rolled out a new deployment of the coredns deployment via kubectl rollout restart -n kube-system deployment/coredns.
  3. After that was back online, I deleted the failing Kafka & Consul Client Pods and everything came back online as expected!

Instead of using most of the defaults for Calico and applying them directly from the website, it would be good to use those as templates but them craft them to your specific environments in these settings and reference those in your scripted installs. If you don't have a great place to host those files, you can use https://gist.github.com/ and reference them directly in your tutorial documents.


Reference:

# This section includes base Calico installation configuration.
# For more information, see: https://projectcalico.docs.tigera.io/v3.23/reference/installation/api#operator.tigera.io/v1.Installation
apiVersion: operator.tigera.io/v1
kind: Installation
metadata:
  name: default
spec:
  # Configures Calico networking.
  calicoNetwork:
    # Note: The ipPools section cannot be modified post-install.
    ipPools:
    - blockSize: 26
      cidr: 192.168.0.0/16
      encapsulation: IPIP
      natOutgoing: Enabled
      nodeSelector: all()

---

# This section configures the Calico API server.
# For more information, see: https://projectcalico.docs.tigera.io/v3.23/reference/installation/api#operator.tigera.io/v1.APIServer
apiVersion: operator.tigera.io/v1
kind: APIServer 
metadata: 
  name: default 
spec: {}
cortx-admin commented 2 years ago

Patrick Hession commented in Jira Server:

.

cortx-admin commented 2 years ago

Patrick Hession commented in Jira Server:

.

cortx-admin commented 2 years ago

Patrick Hession commented in Jira Server:

.