Closed faradawn closed 2 years ago
For the convenience of the Seagate development team, this issue has been mirrored in a private Seagate Jira Server: https://jts.seagate.com/browse/CORTX-29167. Note that community members will not be able to access that Jira server but that is not a problem since all activity in that Jira mirror will be copied into this GitHub issue.
Hi @faradawn , thanks for the detailed information. I am not sure of the problem, but I can advise on a few changes and request a bit more info if that doesn't work.
First, I recommend doing your initial release with a labeled release. The most recent release is v0.0.22. (I can see from the cortx images in solution.yaml that you are probably working on the integration branch.)
Second, prereq-deploy-cortx-cloud.sh expects a block device to be specified a parameter, not a partition. (In fact, prereq attempts to create a file system on the device and then mounts it.) You can try that prereq again, or you can just make sure that /dev/sdb1 is mounted at /mnt/fs-local-file-system. (Or if already mounted somewhere else, then update solution.yaml to point at this file system (solution>common>storage_provisioner_path
).
If these changes don't help (and they might not), can you please post here whatever information you can get from kubectl logs kafka-0
?
Thanks, Walter
Hi Walter @walterlopatka , Thanks so much for you careful reply! Made the following changes: 1) used the new solution.ymal from the main branch (I assumed it contained the most recent releases of images?) 2) passed a disk (instead of a partition) into the prereq script
[Edit: I found that Consul pod failed before Kafka, thus looking into its log. Thinking it might be port issue, will open the port 53 and 8000-9000 and retry!]
[root@master-node cc]# kubectl logs consul-server-1
==> Starting Consul agent...
Version: '1.10.0'
Node ID: 'f9fb533a-c52f-b4db-7a03-95e51471d14d'
Node name: 'consul-server-1'
Datacenter: 'dc1' (Segment: '<all>')
Server: true (Bootstrap: false)
Client Addr: [0.0.0.0] (HTTP: 8500, HTTPS: -1, gRPC: -1, DNS: 8600)
Cluster Addr: 10.32.0.4 (LAN: 8301, WAN: 8302)
Encrypt: Gossip: false, TLS-Outgoing: false, TLS-Incoming: false, Auto-Encrypt-TLS: false
==> Log data will now stream in as it occurs:
2022-03-09T05:19:15.930Z [WARN] agent: bootstrap_expect = 2: A cluster with 2 servers will provide no failure tolerance. See https://www.consul.io/docs/internals/consensus.html#deployment-table
2022-03-09T05:19:15.930Z [WARN] agent: bootstrap_expect > 0: expecting 2 servers
2022-03-09T05:19:16.013Z [WARN] agent.auto_config: bootstrap_expect = 2: A cluster with 2 servers will provide no failure tolerance. See https://www.consul.io/docs/internals/consensus.html#deployment-table
2022-03-09T05:19:16.013Z [WARN] agent.auto_config: bootstrap_expect > 0: expecting 2 servers
2022-03-09T05:19:16.124Z [INFO] agent.server.raft: initial configuration: index=0 servers=[]
2022-03-09T05:19:16.124Z [INFO] agent.server.raft: entering follower state: follower="Node at 10.32.0.4:8300 [Follower]" leader=
2022-03-09T05:19:16.125Z [INFO] agent.server.serf.wan: serf: EventMemberJoin: consul-server-1.dc1 10.32.0.4
2022-03-09T05:19:16.125Z [WARN] agent.server.serf.wan: serf: Failed to re-join any previously known node
2022-03-09T05:19:16.125Z [INFO] agent.server.serf.lan: serf: EventMemberJoin: consul-server-1 10.32.0.4
2022-03-09T05:19:16.125Z [INFO] agent.router: Initializing LAN area manager
2022-03-09T05:19:16.125Z [WARN] agent.server.serf.lan: serf: Failed to re-join any previously known node
2022-03-09T05:19:16.125Z [INFO] agent.server: Adding LAN server: server="consul-server-1 (Addr: tcp/10.32.0.4:8300) (DC: dc1)"
2022-03-09T05:19:16.125Z [WARN] agent: grpc: addrConn.createTransport failed to connect to {10.32.0.4:8300 0 consul-server-1 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.32.0.4:8300: operation was canceled". Reconnecting...
2022-03-09T05:19:16.125Z [INFO] agent.server: Handled event for server in area: event=member-join server=consul-server-1.dc1 area=wan
2022-03-09T05:19:16.126Z [INFO] agent: Started DNS server: address=0.0.0.0:8600 network=tcp
2022-03-09T05:19:16.208Z [INFO] agent: Started DNS server: address=0.0.0.0:8600 network=udp
2022-03-09T05:19:16.209Z [INFO] agent: Starting server: address=[::]:8500 network=tcp protocol=http
2022-03-09T05:19:16.209Z [WARN] agent: DEPRECATED Backwards compatibility with pre-1.9 metrics enabled. These metrics will be removed in a future version of Consul. Set `telemetry { disable_compat_1.9 = true }` to disable them.
2022-03-09T05:19:16.209Z [INFO] agent: Retry join is supported for the following discovery methods: cluster=LAN discovery_methods="aliyun aws azure digitalocean gce k8s linode mdns os packet scaleway softlayer tencentcloud triton vsphere"
2022-03-09T05:19:16.209Z [INFO] agent: Joining cluster...: cluster=LAN
2022-03-09T05:19:16.209Z [INFO] agent: (LAN) joining: lan_addresses=[consul-server-0.consul-server.default.svc:8301, consul-server-1.consul-server.default.svc:8301]
2022-03-09T05:19:16.209Z [INFO] agent: started state syncer
2022-03-09T05:19:16.209Z [INFO] agent: Consul agent running!
2022-03-09T05:19:21.532Z [WARN] agent.server.raft: no known peers, aborting election
2022-03-09T05:19:23.488Z [ERROR] agent.anti_entropy: failed to sync remote state: error="No cluster leader"
2022-03-09T05:19:26.212Z [WARN] agent.server.memberlist.lan: memberlist: Failed to resolve consul-server-0.consul-server.default.svc:8301: lookup consul-server-0.consul-server.default.svc on 10.96.0.10:53: read udp 10.32.0.4:59946->10.96.0.10:53: read: connection refused
2022-03-09T05:19:36.215Z [WARN] agent.server.memberlist.lan: memberlist: Failed to resolve consul-server-1.consul-server.default.svc:8301: lookup consul-server-1.consul-server.default.svc on 10.96.0.10:53: read udp 10.32.0.4:36283->10.96.0.10:53: read: connection refused
2022-03-09T05:19:36.215Z [WARN] agent: (LAN) couldn't join: number_of_nodes=0 error="2 errors occurred:
* Failed to resolve consul-server-0.consul-server.default.svc:8301: lookup consul-server-0.consul-server.default.svc on 10.96.0.10:53: read udp 10.32.0.4:59946->10.96.0.10:53: read: connection refused
* Failed to resolve consul-server-1.consul-server.default.svc:8301: lookup consul-server-1.consul-server.default.svc on 10.96.0.10:53: read udp 10.32.0.4:36283->10.96.0.10:53: read: connection refused
"
2022-03-09T05:19:36.215Z [WARN] agent: Join cluster failed, will retry: cluster=LAN retry_interval=30s error=<nil>
2022-03-09T05:19:41.571Z [ERROR] agent: Failed to check for updates: error="Get "https://checkpoint-api.hashicorp.com/v1/check/consul?arch=amd64&os=linux&signature=f4981526-92a8-6a42-6c07-057b3243f162&version=1.10.0": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
2022-03-09T05:19:52.184Z [ERROR] agent: Coordinate update error: error="No cluster leader"
2022-03-09T05:19:58.052Z [ERROR] agent.anti_entropy: failed to sync remote state: error="No cluster leader"
2022-03-09T05:20:06.216Z [INFO] agent: (LAN) joining: lan_addresses=[consul-server-0.consul-server.default.svc:8301, consul-server-1.consul-server.default.svc:8301]
2022-03-09T05:20:16.219Z [WARN] agent.server.memberlist.lan: memberlist: Failed to resolve consul-server-0.consul-server.default.svc:8301: lookup consul-server-0.consul-server.default.svc on 10.96.0.10:53: read udp 10.32.0.4:41387->10.96.0.10:53: read: connection refused
2022-03-09T05:20:23.288Z [ERROR] agent: Coordinate update error: error="No cluster leader"
2022-03-09T05:20:23.775Z [ERROR] agent.anti_entropy: failed to sync remote state: error="No cluster leader"
2022-03-09T05:20:26.222Z [WARN] agent.server.memberlist.lan: memberlist: Failed to resolve consul-server-1.consul-server.default.svc:8301: lookup consul-server-1.consul-server.default.svc on 10.96.0.10:53: read udp 10.32.0.4:57400->10.96.0.10:53: read: connection refused
2022-03-09T05:20:26.222Z [WARN] agent: (LAN) couldn't join: number_of_nodes=0 error="2 errors occurred:
* Failed to resolve consul-server-0.consul-server.default.svc:8301: lookup consul-server-0.consul-server.default.svc on 10.96.0.10:53: read udp 10.32.0.4:41387->10.96.0.10:53: read: connection refused
* Failed to resolve consul-server-1.consul-server.default.svc:8301: lookup consul-server-1.consul-server.default.svc on 10.96.0.10:53: read udp 10.32.0.4:57400->10.96.0.10:53: read: connection refused
[root@faradawn-master k8_cortx_cloud]# ./deploy-cortx-cloud.sh solution.yaml
Validate solution file result: success
Number of worker nodes detected: 2
W0309 05:14:08.086968 12198 warnings.go:70] policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
W0309 05:14:08.102833 12198 warnings.go:70] policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
NAME: cortx-platform
LAST DEPLOYED: Wed Mar 9 05:14:07 2022
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None
"hashicorp" has been added to your repositories
Hang tight while we grab the latest from your chart repositories...
...Successfully got an update from the "hashicorp" chart repository
Update Complete. ⎈Happy Helming!⎈
Install Rancher Local Path Provisionernamespace/local-path-storage created
serviceaccount/local-path-provisioner-service-account created
clusterrole.rbac.authorization.k8s.io/local-path-provisioner-role created
clusterrolebinding.rbac.authorization.k8s.io/local-path-provisioner-bind created
deployment.apps/local-path-provisioner created
storageclass.storage.k8s.io/local-path created
configmap/local-path-config created
######################################################
# Deploy Consul
######################################################
Error: INSTALLATION FAILED: timed out waiting for the condition
serviceaccount/consul-client patched
serviceaccount/consul-server patched
statefulset.apps/consul-server restarted
daemonset.apps/consul-client restarted
######################################################
# Deploy Zookeeper
######################################################
"bitnami" has been added to your repositories
Hang tight while we grab the latest from your chart repositories...
...Successfully got an update from the "bitnami" chart repository
Update Complete. ⎈Happy Helming!⎈
Registry: ghcr.io
Repository: seagate/zookeeper
Tag: 3.7.0-debian-10-r182
NAME: zookeeper
LAST DEPLOYED: Wed Mar 9 05:19:16 2022
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
CHART NAME: zookeeper
CHART VERSION: 8.1.1
APP VERSION: 3.7.0
** Please be patient while the chart is being deployed **
ZooKeeper can be accessed via port 2181 on the following DNS name from within your cluster:
zookeeper.default.svc.cluster.local
To connect to your ZooKeeper server run the following commands:
export POD_NAME=$(kubectl get pods --namespace default -l "app.kubernetes.io/name=zookeeper,app.kubernetes.io/instance=zookeeper,app.kubernetes.io/component=zookeeper" -o jsonpath="{.items[0].metadata.name}")
kubectl exec -it $POD_NAME -- zkCli.sh
To connect to your ZooKeeper server from outside the cluster execute the following commands:
kubectl port-forward --namespace default svc/zookeeper 2181: &
zkCli.sh 127.0.0.1:2181
Wait for Zookeeper to be ready before starting kafka
######################################################
# Deploy Kafka
######################################################
Registry: ghcr.io
Repository: seagate/kafka
Tag: 3.0.0-debian-10-r7
Error: INSTALLATION FAILED: timed out waiting for the condition
Wait for CORTX 3rd party to be ready...............................................................................................................................................
used two nodes this time
[root@faradawn-master k8_cortx_cloud]# kubectl get pods
NAME READY STATUS RESTARTS AGE
consul-client-5tfj5 0/1 Running 0 15m
consul-client-rk9c2 0/1 Running 0 15m
consul-server-0 0/1 Running 0 20m
consul-server-1 0/1 Running 0 15m
kafka-0 0/1 CrashLoopBackOff 7 (74s ago) 14m
kafka-1 0/1 CrashLoopBackOff 7 (2m14s ago) 14m
zookeeper-0 1/1 Running 0 15m
zookeeper-1 1/1 Running 0 15m
Kafka log
[root@faradawn-master k8_cortx_cloud]# kubectl logs kafka-0
kafka 05:33:09.27
kafka 05:33:09.27 Welcome to the Bitnami kafka container
kafka 05:33:09.27 Subscribe to project updates by watching https://github.com/bitnami/bitnami-docker-kafka
kafka 05:33:09.27 Submit issues and feature requests at https://github.com/bitnami/bitnami-docker-kafka/issues
kafka 05:33:09.27
kafka 05:33:09.27 INFO ==> ** Starting Kafka setup **
kafka 05:33:09.34 WARN ==> You set the environment variable ALLOW_PLAINTEXT_LISTENER=yes. For safety reasons, do not use this flag in a production environment.
kafka 05:33:09.35 INFO ==> Initializing Kafka...
kafka 05:33:09.36 INFO ==> No injected configuration files found, creating default config files
kafka 05:33:09.67 INFO ==> Configuring Kafka for inter-broker communications with PLAINTEXT authentication.
kafka 05:33:09.68 WARN ==> Inter-broker communications are configured as PLAINTEXT. This is not safe for production environments.
kafka 05:33:09.68 INFO ==> Configuring Kafka for client communications with PLAINTEXT authentication.
kafka 05:33:09.69 WARN ==> Client communications are configured using PLAINTEXT listeners. For safety reasons, do not use this in a production environment.
kafka 05:33:09.70 INFO ==> ** Kafka setup finished! **
kafka 05:33:09.72 INFO ==> ** Starting Kafka **
[2022-03-09 05:33:11,185] INFO Registered kafka:type=kafka.Log4jController MBean (kafka.utils.Log4jControllerRegistration$)
[2022-03-09 05:33:11,884] INFO Setting -D jdk.tls.rejectClientInitiatedRenegotiation=true to disable client-initiated TLS renegotiation (org.apache.zookeeper.common.X509Util)
[2022-03-09 05:33:12,054] INFO Registered signal handlers for TERM, INT, HUP (org.apache.kafka.common.utils.LoggingSignalHandler)
[2022-03-09 05:33:12,057] INFO starting (kafka.server.KafkaServer)
[2022-03-09 05:33:12,058] INFO Connecting to zookeeper on zookeeper.default.svc.cluster.local (kafka.server.KafkaServer)
[2022-03-09 05:33:12,072] INFO [ZooKeeperClient Kafka server] Initializing a new session to zookeeper.default.svc.cluster.local. (kafka.zookeeper.ZooKeeperClient)
[2022-03-09 05:33:12,076] INFO Client environment:zookeeper.version=3.6.3--6401e4ad2087061bc6b9f80dec2d69f2e3c8660a, built on 04/08/2021 16:35 GMT (org.apache.zookeeper.ZooKeeper)
[2022-03-09 05:33:12,076] INFO Client environment:host.name=kafka-0.kafka-headless.default.svc.cluster.local (org.apache.zookeeper.ZooKeeper)
[2022-03-09 05:33:12,076] INFO Client environment:java.version=11.0.12 (org.apache.zookeeper.ZooKeeper)
[2022-03-09 05:33:12,076] INFO Client environment:java.vendor=BellSoft (org.apache.zookeeper.ZooKeeper)
[2022-03-09 05:33:12,076] INFO Client environment:java.home=/opt/bitnami/java (org.apache.zookeeper.ZooKeeper)
[2022-03-09 05:33:12,076] INFO Client environment:java.class.path=/opt/bitnami/kafka/bin/../libs/activation-1.1.1.jar:/opt/bitnami/kafka/bin/../libs/aopalliance-repackaged-2.6.1.jar:/opt/bitnami/kafka/bin/../libs/argparse4j-0.7.0.jar:/opt/bitnami/kafka/bin/../libs/audience-annotations-0.5.0.jar:/opt/bitnami/kafka/bin/../libs/commons-cli-1.4.jar:/opt/bitnami/kafka/bin/../libs/commons-lang3-3.8.1.jar:/opt/bitnami/kafka/bin/../libs/connect-api-3.0.0.jar:/opt/bitnami/kafka/bin/../libs/connect-basic-auth-extension-3.0.0.jar:/opt/bitnami/kafka/bin/../libs/connect-file-3.0.0.jar:/opt/bitnami/kafka/bin/../libs/connect-json-3.0.0.jar:/opt/bitnami/kafka/bin/../libs/connect-mirror-3.0.0.jar:/opt/bitnami/kafka/bin/../libs/connect-mirror-client-3.0.0.jar:/opt/bitnami/kafka/bin/../libs/connect-runtime-3.0.0.jar:/opt/bitnami/kafka/bin/../libs/connect-transforms-3.0.0.jar:/opt/bitnami/kafka/bin/../libs/hk2-api-2.6.1.jar:/opt/bitnami/kafka/bin/../libs/hk2-locator-2.6.1.jar:/opt/bitnami/kafka/bin/../libs/hk2-utils-2.6.1.jar:/opt/bitnami/kafka/bin/../libs/jackson-annotations-2.12.3.jar:/opt/bitnami/kafka/bin/../libs/jackson-core-2.12.3.jar:/opt/bitnami/kafka/bin/../libs/jackson-databind-2.12.3.jar:/opt/bitnami/kafka/bin/../libs/jackson-dataformat-csv-2.12.3.jar:/opt/bitnami/kafka/bin/../libs/jackson-datatype-jdk8-2.12.3.jar:/opt/bitnami/kafka/bin/../libs/jackson-jaxrs-base-2.12.3.jar:/opt/bitnami/kafka/bin/../libs/jackson-jaxrs-json-provider-2.12.3.jar:/opt/bitnami/kafka/bin/../libs/jackson-module-jaxb-annotations-2.12.3.jar:/opt/bitnami/kafka/bin/../libs/jackson-module-scala_2.12-2.12.3.jar:/opt/bitnami/kafka/bin/../libs/jakarta.activation-api-1.2.1.jar:/opt/bitnami/kafka/bin/../libs/jakarta.annotation-api-1.3.5.jar:/opt/bitnami/kafka/bin/../libs/jakarta.inject-2.6.1.jar:/opt/bitnami/kafka/bin/../libs/jakarta.validation-api-2.0.2.jar:/opt/bitnami/kafka/bin/../libs/jakarta.ws.rs-api-2.1.6.jar:/opt/bitnami/kafka/bin/../libs/jakarta.xml.bind-api-2.3.2.jar:/opt/bitnami/kafka/bin/../libs/javassist-3.27.0-GA.jar:/opt/bitnami/kafka/bin/../libs/javax.servlet-api-3.1.0.jar:/opt/bitnami/kafka/bin/../libs/javax.ws.rs-api-2.1.1.jar:/opt/bitnami/kafka/bin/../libs/jaxb-api-2.3.0.jar:/opt/bitnami/kafka/bin/../libs/jersey-client-2.34.jar:/opt/bitnami/kafka/bin/../libs/jersey-common-2.34.jar:/opt/bitnami/kafka/bin/../libs/jersey-container-servlet-2.34.jar:/opt/bitnami/kafka/bin/../libs/jersey-container-servlet-core-2.34.jar:/opt/bitnami/kafka/bin/../libs/jersey-hk2-2.34.jar:/opt/bitnami/kafka/bin/../libs/jersey-server-2.34.jar:/opt/bitnami/kafka/bin/../libs/jetty-client-9.4.43.v20210629.jar:/opt/bitnami/kafka/bin/../libs/jetty-continuation-9.4.43.v20210629.jar:/opt/bitnami/kafka/bin/../libs/jetty-http-9.4.43.v20210629.jar:/opt/bitnami/kafka/bin/../libs/jetty-io-9.4.43.v20210629.jar:/opt/bitnami/kafka/bin/../libs/jetty-security-9.4.43.v20210629.jar:/opt/bitnami/kafka/bin/../libs/jetty-server-9.4.43.v20210629.jar:/opt/bitnami/kafka/bin/../libs/jetty-servlet-9.4.43.v20210629.jar:/opt/bitnami/kafka/bin/../libs/jetty-servlets-9.4.43.v20210629.jar:/opt/bitnami/kafka/bin/../libs/jetty-util-9.4.43.v20210629.jar:/opt/bitnami/kafka/bin/../libs/jetty-util-ajax-9.4.43.v20210629.jar:/opt/bitnami/kafka/bin/../libs/jline-3.12.1.jar:/opt/bitnami/kafka/bin/../libs/jopt-simple-5.0.4.jar:/opt/bitnami/kafka/bin/../libs/kafka-clients-3.0.0.jar:/opt/bitnami/kafka/bin/../libs/kafka-log4j-appender-3.0.0.jar:/opt/bitnami/kafka/bin/../libs/kafka-metadata-3.0.0.jar:/opt/bitnami/kafka/bin/../libs/kafka-raft-3.0.0.jar:/opt/bitnami/kafka/bin/../libs/kafka-server-common-3.0.0.jar:/opt/bitnami/kafka/bin/../libs/kafka-shell-3.0.0.jar:/opt/bitnami/kafka/bin/../libs/kafka-storage-3.0.0.jar:/opt/bitnami/kafka/bin/../libs/kafka-storage-api-3.0.0.jar:/opt/bitnami/kafka/bin/../libs/kafka-streams-3.0.0.jar:/opt/bitnami/kafka/bin/../libs/kafka-streams-examples-3.0.0.jar:/opt/bitnami/kafka/bin/../libs/kafka-streams-scala_2.12-3.0.0.jar:/opt/bitnami/kafka/bin/../libs/kafka-streams-test-utils-3.0.0.jar:/opt/bitnami/kafka/bin/../libs/kafka-tools-3.0.0.jar:/opt/bitnami/kafka/bin/../libs/kafka_2.12-3.0.0.jar:/opt/bitnami/kafka/bin/../libs/log4j-1.2.17.jar:/opt/bitnami/kafka/bin/../libs/lz4-java-1.7.1.jar:/opt/bitnami/kafka/bin/../libs/maven-artifact-3.8.1.jar:/opt/bitnami/kafka/bin/../libs/metrics-core-2.2.0.jar:/opt/bitnami/kafka/bin/../libs/metrics-core-4.1.12.1.jar:/opt/bitnami/kafka/bin/../libs/netty-buffer-4.1.62.Final.jar:/opt/bitnami/kafka/bin/../libs/netty-codec-4.1.62.Final.jar:/opt/bitnami/kafka/bin/../libs/netty-common-4.1.62.Final.jar:/opt/bitnami/kafka/bin/../libs/netty-handler-4.1.62.Final.jar:/opt/bitnami/kafka/bin/../libs/netty-resolver-4.1.62.Final.jar:/opt/bitnami/kafka/bin/../libs/netty-transport-4.1.62.Final.jar:/opt/bitnami/kafka/bin/../libs/netty-transport-native-epoll-4.1.62.Final.jar:/opt/bitnami/kafka/bin/../libs/netty-transport-native-unix-common-4.1.62.Final.jar:/opt/bitnami/kafka/bin/../libs/osgi-resource-locator-1.0.3.jar:/opt/bitnami/kafka/bin/../libs/paranamer-2.8.jar:/opt/bitnami/kafka/bin/../libs/plexus-utils-3.2.1.jar:/opt/bitnami/kafka/bin/../libs/reflections-0.9.12.jar:/opt/bitnami/kafka/bin/../libs/rocksdbjni-6.19.3.jar:/opt/bitnami/kafka/bin/../libs/scala-collection-compat_2.12-2.4.4.jar:/opt/bitnami/kafka/bin/../libs/scala-java8-compat_2.12-1.0.0.jar:/opt/bitnami/kafka/bin/../libs/scala-library-2.12.14.jar:/opt/bitnami/kafka/bin/../libs/scala-logging_2.12-3.9.3.jar:/opt/bitnami/kafka/bin/../libs/scala-reflect-2.12.14.jar:/opt/bitnami/kafka/bin/../libs/slf4j-api-1.7.30.jar:/opt/bitnami/kafka/bin/../libs/slf4j-log4j12-1.7.30.jar:/opt/bitnami/kafka/bin/../libs/snappy-java-1.1.8.1.jar:/opt/bitnami/kafka/bin/../libs/trogdor-3.0.0.jar:/opt/bitnami/kafka/bin/../libs/zookeeper-3.6.3.jar:/opt/bitnami/kafka/bin/../libs/zookeeper-jute-3.6.3.jar:/opt/bitnami/kafka/bin/../libs/zstd-jni-1.5.0-2.jar (org.apache.zookeeper.ZooKeeper)
[2022-03-09 05:33:12,077] INFO Client environment:java.library.path=/usr/java/packages/lib:/usr/lib64:/lib64:/lib:/usr/lib (org.apache.zookeeper.ZooKeeper)
[2022-03-09 05:33:12,077] INFO Client environment:java.io.tmpdir=/tmp (org.apache.zookeeper.ZooKeeper)
[2022-03-09 05:33:12,077] INFO Client environment:java.compiler=<NA> (org.apache.zookeeper.ZooKeeper)
[2022-03-09 05:33:12,077] INFO Client environment:os.name=Linux (org.apache.zookeeper.ZooKeeper)
[2022-03-09 05:33:12,077] INFO Client environment:os.arch=amd64 (org.apache.zookeeper.ZooKeeper)
[2022-03-09 05:33:12,077] INFO Client environment:os.version=3.10.0-1127.19.1.el7.x86_64 (org.apache.zookeeper.ZooKeeper)
[2022-03-09 05:33:12,077] INFO Client environment:user.name=? (org.apache.zookeeper.ZooKeeper)
[2022-03-09 05:33:12,077] INFO Client environment:user.home=? (org.apache.zookeeper.ZooKeeper)
[2022-03-09 05:33:12,077] INFO Client environment:user.dir=/ (org.apache.zookeeper.ZooKeeper)
[2022-03-09 05:33:12,077] INFO Client environment:os.memory.free=1009MB (org.apache.zookeeper.ZooKeeper)
[2022-03-09 05:33:12,077] INFO Client environment:os.memory.max=1024MB (org.apache.zookeeper.ZooKeeper)
[2022-03-09 05:33:12,077] INFO Client environment:os.memory.total=1024MB (org.apache.zookeeper.ZooKeeper)
[2022-03-09 05:33:12,079] INFO Initiating client connection, connectString=zookeeper.default.svc.cluster.local sessionTimeout=18000 watcher=kafka.zookeeper.ZooKeeperClient$ZooKeeperClientWatcher$@39a8312f (org.apache.zookeeper.ZooKeeper)
[2022-03-09 05:33:12,084] INFO jute.maxbuffer value is 4194304 Bytes (org.apache.zookeeper.ClientCnxnSocket)
[2022-03-09 05:33:12,089] INFO zookeeper.request.timeout value is 0. feature enabled=false (org.apache.zookeeper.ClientCnxn)
[2022-03-09 05:33:12,090] INFO [ZooKeeperClient Kafka server] Waiting until connected. (kafka.zookeeper.ZooKeeperClient)
[2022-03-09 05:33:18,091] INFO [ZooKeeperClient Kafka server] Closing. (kafka.zookeeper.ZooKeeperClient)
[2022-03-09 05:33:32,108] ERROR Unable to resolve address: zookeeper.default.svc.cluster.local:2181 (org.apache.zookeeper.client.StaticHostProvider)
java.net.UnknownHostException: zookeeper.default.svc.cluster.local: Temporary failure in name resolution
at java.base/java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method)
at java.base/java.net.InetAddress$PlatformNameService.lookupAllHostAddr(InetAddress.java:929)
at java.base/java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1519)
at java.base/java.net.InetAddress$NameServiceAddresses.get(InetAddress.java:848)
at java.base/java.net.InetAddress.getAllByName0(InetAddress.java:1509)
at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1368)
at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1302)
at org.apache.zookeeper.client.StaticHostProvider$1.getAllByName(StaticHostProvider.java:88)
at org.apache.zookeeper.client.StaticHostProvider.resolve(StaticHostProvider.java:141)
at org.apache.zookeeper.client.StaticHostProvider.next(StaticHostProvider.java:368)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1207)
[2022-03-09 05:33:32,116] WARN An exception was thrown while closing send thread for session 0x0. (org.apache.zookeeper.ClientCnxn)
java.lang.IllegalArgumentException: Unable to canonicalize address zookeeper.default.svc.cluster.local:2181 because it's not resolvable
at org.apache.zookeeper.SaslServerPrincipal.getServerPrincipal(SaslServerPrincipal.java:78)
at org.apache.zookeeper.SaslServerPrincipal.getServerPrincipal(SaslServerPrincipal.java:41)
at org.apache.zookeeper.ClientCnxn$SendThread.startConnect(ClientCnxn.java:1161)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1210)
[2022-03-09 05:33:32,221] INFO Session: 0x0 closed (org.apache.zookeeper.ZooKeeper)
[2022-03-09 05:33:32,222] INFO EventThread shut down for session: 0x0 (org.apache.zookeeper.ClientCnxn)
[2022-03-09 05:33:32,223] INFO [ZooKeeperClient Kafka server] Closed. (kafka.zookeeper.ZooKeeperClient)
[2022-03-09 05:33:32,226] ERROR Fatal error during KafkaServer startup. Prepare to shutdown (kafka.server.KafkaServer)
kafka.zookeeper.ZooKeeperClientTimeoutException: Timed out waiting for connection while in state: CONNECTING
at kafka.zookeeper.ZooKeeperClient.$anonfun$waitUntilConnected$3(ZooKeeperClient.scala:254)
at kafka.zookeeper.ZooKeeperClient.waitUntilConnected(ZooKeeperClient.scala:250)
at kafka.zookeeper.ZooKeeperClient.<init>(ZooKeeperClient.scala:108)
at kafka.zk.KafkaZkClient$.apply(KafkaZkClient.scala:1981)
at kafka.server.KafkaServer.initZkClient(KafkaServer.scala:457)
at kafka.server.KafkaServer.startup(KafkaServer.scala:196)
at kafka.Kafka$.main(Kafka.scala:109)
at kafka.Kafka.main(Kafka.scala)
[2022-03-09 05:33:32,227] INFO shutting down (kafka.server.KafkaServer)
[2022-03-09 05:33:32,232] INFO App info kafka.server for 0 unregistered (org.apache.kafka.common.utils.AppInfoParser)
[2022-03-09 05:33:32,232] INFO shut down completed (kafka.server.KafkaServer)
[2022-03-09 05:33:32,233] ERROR Exiting Kafka. (kafka.Kafka$)
[2022-03-09 05:33:32,233] INFO shutting down (kafka.server.KafkaServer)
[root@faradawn-master k8_cortx_cloud]# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 1.8T 0 disk
sdb 8:16 0 1.8T 0 disk /mnt/fs-local-volume
sdc 8:32 0 1.8T 0 disk
sdd 8:48 0 1.8T 0 disk
sde 8:64 0 1.8T 0 disk
sdf 8:80 0 1.8T 0 disk
sdg 8:96 0 1.8T 0 disk
sdh 8:112 0 1.8T 0 disk
sdi 8:128 0 1.8T 0 disk
sdj 8:144 0 1.8T 0 disk
sdk 8:160 0 1.8T 0 disk
sdl 8:176 0 1.8T 0 disk
sdm 8:192 0 1.8T 0 disk
sdn 8:208 0 1.8T 0 disk
sdo 8:224 0 1.8T 0 disk
sdp 8:240 0 1.8T 0 disk
sdq 65:0 0 372.6G 0 disk
└─sdq1 65:1 0 372.6G 0 part /
solution.ymal
solution:
namespace: default
secrets:
name: cortx-secret
content:
kafka_admin_secret: Seagate@123
consul_admin_secret: Seagate@123
common_admin_secret: Seagate@123
s3_auth_admin_secret: cortxadmin
csm_auth_admin_secret: seagate2
csm_mgmt_admin_secret: Cortxadmin@123
images:
cortxcontrol: ghcr.io/seagate/cortx-all:2.0.0-664
cortxdata: ghcr.io/seagate/cortx-all:2.0.0-664
cortxserver: ghcr.io/seagate/cortx-rgw:2.0.0-664
cortxha: ghcr.io/seagate/cortx-all:2.0.0-664
cortxclient: ghcr.io/seagate/cortx-all:2.0.0-664
consul: ghcr.io/seagate/consul:1.10.0
kafka: ghcr.io/seagate/kafka:3.0.0-debian-10-r7
zookeeper: ghcr.io/seagate/zookeeper:3.7.0-debian-10-r182
rancher: ghcr.io/seagate/local-path-provisioner:v0.0.20
busybox: ghcr.io/seagate/busybox:latest
common:
setup_size: large
storage_provisioner_path: /mnt/fs-local-volume
container_path:
local: /etc/cortx
shared: /share
log: /etc/cortx/log
s3:
default_iam_users:
auth_admin: "sgiamadmin"
auth_user: "user_name"
#auth_secret defined above in solution.secrets.content.s3_auth_admin_secret
num_inst: 2
start_port_num: 28051
max_start_timeout: 240
motr:
num_client_inst: 0
start_port_num: 29000
hax:
protocol: https
service_name: cortx-hax-svc
port_num: 22003
storage_sets:
name: storage-set-1
durability:
sns: 1+0+0
dix: 1+0+0
external_services:
s3:
type: NodePort
count: 1
ports:
http: 8000
https: 8443
nodePorts:
http: ""
https: ""
control:
type: NodePort
ports:
https: 8081
nodePorts:
https: ""
resource_allocation:
consul:
server:
storage: 10Gi
resources:
requests:
memory: 100Mi
cpu: 100m
limits:
memory: 300Mi
cpu: 100m
client:
resources:
requests:
memory: 100Mi
cpu: 100m
limits:
memory: 300Mi
cpu: 100m
zookeeper:
storage_request_size: 8Gi
data_log_dir_request_size: 8Gi
resources:
requests:
memory: 256Mi
cpu: 250m
limits:
memory: 512Mi
cpu: 500m
kafka:
storage_request_size: 8Gi
log_persistence_request_size: 8Gi
resources:
requests:
memory: 1Gi
cpu: 250m
limits:
memory: 2Gi
cpu: 1
storage:
cvg1:
name: cvg-01
type: ios
devices:
metadata:
device: /dev/sdc
size: 5Gi
data:
d1:
device: /dev/sdd
size: 5Gi
d2:
device: /dev/sde
size: 5Gi
nodes:
node1:
name: worker-node-1
node2:
name: worker-node-2
It must be a lot of information! It's okay if it takes too much time -- I can keep trying!
You must be very busy -- cannot thank you enough if you would take a look!
If there is anything I could do, please let me know!
Thanks, Faradawn
Hi @faradawn , I'm sorry for the late response. I was out of the office on vacation last week.
From this error I see that the zookeeper service name is not resolving:
[2022-03-09 05:33:32,108] ERROR Unable to resolve address: zookeeper.default.svc.cluster.local:2181 (org.apache.zookeeper.client.StaticHostProvider)
java.net.UnknownHostException: zookeeper.default.svc.cluster.local: Temporary failure in name resolution
(I also see that the consul pods are not running.)
This is a symptom of problems with Kubernetes networking (though I am not sure that is the problem). I do not have deep expertise in Kubernetes networking / CNI. I am using Calico and running on Centos 7.9. What are you running?) My recent experience is that running Calico on Rocky Linux 8 has some problems, so I know that there can be issues between OS and K8s CNI.
I'm not sure what to advise yet, but I am curious on what OS you are running, and how your k8s is deployed. You might consider some of the diagnostic steps described here if you haven't done any network diagnosis yet.
Hi Walter,
Thanks so much for the reply!
I was running on Centos 7, using the Weavenet network.
As for how the k8s was deployed, here is an installation script that I ran:
echo -e '\n === Part1: install kubernetes and docker === \n'
cat <<EOF > /etc/yum.repos.d/kubernetes.repo
[kubernetes]
name=Kubernetes
baseurl=https://packages.cloud.google.com/yum/repos/kubernetes-el7-x86_64
enabled=1
gpgkey=https://packages.cloud.google.com/yum/doc/yum-key.gpg https://packages.cloud.google.com/yum/doc/rpm-package-key.gpg
EOF
yum check-update
echo y | yum install -y yum-utils device-mapper-persistent-data lvm2 firewalld docker kubelet kubeadm kubectl
systemctl enable docker && systemctl start docker
systemctl enable kubelet && systemctl start kubelet
echo -e '\n === Part2: configure firewall === \n'
cat <<EOF>> /etc/hosts
10.52.0.60 master-node
10.52.0.242 worker-node-1
10.52.3.14 worker-node-2
EOF
systemctl start firewalld
sudo firewall-cmd --permanent --add-port=6443/tcp
sudo firewall-cmd --permanent --add-port=2379-2380/tcp
sudo firewall-cmd --permanent --add-port=10250/tcp
sudo firewall-cmd --permanent --add-port=10251/tcp
sudo firewall-cmd --permanent --add-port=10252/tcp
sudo firewall-cmd --permanent --add-port=10255/tcp
sudo firewall-cmd --permanent --add-port=53-60000/tcp
sudo firewall-cmd --permanent --add-port=53-60000/udp
sudo firewall-cmd --reload
# update IP table
cat <<EOF > /etc/sysctl.d/k8s.conf
net.bridge.bridge-nf-call-ip6tables = 1
net.bridge.bridge-nf-call-iptables = 1
EOF
sysctl --system
# SElinx permissive mode
setenforce 0
sed -i --follow-symlinks 's/SELINUX=enforcing/SELINUX=disabled/g' /etc/sysconfig/selinux
sed -i '/swap/d' /etc/fstab
swapoff -a
echo -e '\n === Part3: Kuber Init ===\n'
if [[ $ME -eq "master" ]]
then
kubeadm init
mkdir -p $HOME/.kube && cp -i /etc/kubernetes/admin.conf $HOME/.kube/config && chown $(id -u):$(id -g) $HOME/.kube/config
export kubever=$(kubectl version | base64 | tr -d '\n')
kubectl apply -f "https://cloud.weave.works/k8s/net?k8s-version=$kubever"
fi
echo -e '\n === done! === \n'
Thanks for suggesting that the problem might relate to CNI, perhaps I can try Calio! Received the k8s diagnostic guide -- it was a great resource! Will look into it try some debugging techniques!
Thanks, Faradawn
Hi Walter,
Progress was made -- owning to your suggestion! The following seemed to resolved Consul and Kafka deployment issue!
nodes
in solution.yaml
(might be a problem?)path-provisioner
on master and the worker node (might be a problem?)Now, the deployment seemed to fail on deploying CORTX data pods:
########################################################
# Deploy CORTX Data
########################################################
NAME: cortx-data-worker-node-1-default
LAST DEPLOYED: Fri Mar 18 04:34:42 2022
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None
NAME: cortx-data-worker-node-2-default
LAST DEPLOYED: Fri Mar 18 04:34:43 2022
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None
Wait for CORTX Data to be ready......................................................................timed out waiting for the condition on deployments/cortx-data-worker-node-1
timed out waiting for the condition on deployments/cortx-data-worker-node-2
.....................................................................................................................^C
[root@worker-node-1 k8_cortx_cloud]# kubectl get pods
NAME READY STATUS RESTARTS AGE
consul-client-5ss94 1/1 Running 0 15m
consul-server-0 1/1 Running 0 15m
cortx-control-85f5858cdb-lj4dj 1/1 Running 0 13m
cortx-data-worker-node-1-74f688c58d-56l2n 0/3 Pending 0 12m
cortx-data-worker-node-2-65b588468f-ntcst 0/3 Init:0/2 0 12m
kafka-0 1/1 Running 0 14m
openldap-0 1/1 Running 0 15m
zookeeper-0 1/1 Running 0 14m
[root@worker-node-1 k8_cortx_cloud]# kubectl logs cortx-data-worker-node-1-74f688c58d-56l2n
error: a container name must be specified for pod cortx-data-worker-node-1-74f688c58d-56l2n, choose one of: [cortx-hax cortx-motr-confd cortx-motr-io-001] or one of the init containers: [cortx-setup node-config]
I wondered does it have to do with whether to include the control plane in the list of nodes in solution.yaml
?
Here, the worker-node-1
is the master (control plane). Do you think maybe I should exclude it (control plane) from the list of nodes?
nodes:
node1:
name: worker-node-1
node2:
name: worker-node-2
[root@worker-node-1 k8_cortx_cloud]# kubectl get nodes
NAME STATUS ROLES AGE VERSION
worker-node-1 Ready control-plane,master 171m v1.23.5
worker-node-2 Ready <none> 150m v1.23.5
Planning test the following:
nodes
section in solution.yaml
Thanks in advance!
Best, Faradawn
Hi @faradawn , great progress! I'm glad to see the 3rd party containers are starting up.
From you listing, it looks like your master node is tainted so that it will not schedule any pods. You can confirm like this:
kubectl describe node worker-node-1 | grep Taint
If it replies with something like Taints: Taints: node-role.kubernetes.io/master:NoSchedule
then it is tainted and will not schedule any pods. In a production environment it makes sense to separate the k8s master from the workers, but in test environments it's simpler and makes more equipment available to allow workers to be scheduled on the master.
You can remove the taint with:
kubectl taint node worker-node-1 node-role.kubernetes.io/master:NoSchedule-
(or replace node-role.kubernetes.io/master
with whatever value you see in the above kubectl describe node | grep Taint
output)
After that you should see two pods for each of the third party sw.
Another way that you can confirm this is by running kubectl describe pod cortx-data-worker-node-1-74f688c58d-56l2n
(the one that is pending), and the Events section will say something like "FailedScheduling" and something about a taint.
Best regards, Walter
Hi Walter,
Thanks for much for the information on taints! Excluding the master node from the list of node on which CORTX deploy data pods resolved the "data pod timeout" issue! Just to confirm, does the follow output implies a successful deployment?
[root@master k8_cortx_cloud]# kubectl exec -it $DATA_POD -c cortx-hax -- /bin/bash -c "hctl status"
Byte_count:
critical_byte_count : 0
damaged_byte_count : 0
degraded_byte_count : 0
healthy_byte_count : 0
Data pool:
# fid name
0x6f00000000000001:0x23 'storage-set-1__sns'
Profile:
# fid name: pool(s)
0x7000000000000001:0x39 'Profile_the_pool': 'storage-set-1__sns' 'storage-set-1__dix' None
Services:
cortx-server-headless-svc-node-1
[started] hax 0x7200000000000001:0x1b inet:tcp:cortx-server-headless-svc-node-1@22001
[started] rgw 0x7200000000000001:0x1e inet:tcp:cortx-server-headless-svc-node-1@21501
cortx-data-headless-svc-node-1 (RC)
[started] hax 0x7200000000000001:0x6 inet:tcp:cortx-data-headless-svc-node-1@22001
[started] ioservice 0x7200000000000001:0x9 inet:tcp:cortx-data-headless-svc-node-1@21001
[started] confd 0x7200000000000001:0x16 inet:tcp:cortx-data-headless-svc-node-1@21002
[Edit: the deployment was successful. Can perform S3 IOs!]
As for a summary of the issue:
./prereq-deploy-cortx-cloud.sh /dev/sdk
ufw
or firewalld
git clone -b main https://github.com/Seagate/cortx-k8s
nodes
section in solution.yaml
setup_size: small
, and try one disk for metadata and two disks for datacsm_auth_admin_secret: seagate2!
Thanks so much, Walter, for resolving the issue for me for the past 2 weeks!
If there is anything I could do, please let me know!
Best, Faradawn
Walter Lopatka commented in Jira Server:
NA
Walter Lopatka commented in Jira Server:
Closed in GitHub
Walter Lopatka commented in Jira Server:
Closed in GitHub
[Edit: solution at the end of the thread]
To Whom It May Concern,
Error Description
When running the
deploy-cortx-cloud.sh
script, I kept getting the error that "Kafka installation failed: time out waiting for condition."Crashed Pod Description
Here is a description of the crashed Kafka pod:
Disk layout
Repartitioned the disks and rebooted the server many times, but still couldn't get over the Kafka deployment issue. Wondered may I ask for some help on what the issue might be?
Below is my disk layout, and I ran
./prereq-deploy-cortx-cloud.sh /dev/sdb1
with the disk parameter as/dev/sdb1
.Solution.yaml:
Sorry I was a little new, and had been trying for a few days. Anything suggestion would help!
Thanks in advance!