Couldn't deploy Kafka during CORTX deployment

faradawn commented 2 years ago

[Edit: solution at the end of the thread]

To Whom It May Concern,

Error Description

When running the deploy-cortx-cloud.sh script, I kept getting the error that "Kafka installation failed: time out waiting for condition."

[root@master-node k8_cortx_cloud]# ./deploy-cortx-cloud.sh solution.yaml

Validate solution file result: success
Number of worker nodes detected: 1
W0302 14:56:09.990541    9500 warnings.go:70] policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
W0302 14:56:10.007528    9500 warnings.go:70] policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
NAME: cortx-platform
LAST DEPLOYED: Wed Mar  2 14:56:09 2022
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None
"hashicorp" has been added to your repositories
Hang tight while we grab the latest from your chart repositories...
...Successfully got an update from the "hashicorp" chart repository
Update Complete. ⎈Happy Helming!⎈
Install Rancher Local Path Provisionernamespace/local-path-storage created
serviceaccount/local-path-provisioner-service-account created
clusterrole.rbac.authorization.k8s.io/local-path-provisioner-role created
clusterrolebinding.rbac.authorization.k8s.io/local-path-provisioner-bind created
deployment.apps/local-path-provisioner created
storageclass.storage.k8s.io/local-path created
configmap/local-path-config created
######################################################
# Deploy Consul                                       
######################################################
NAME: consul
LAST DEPLOYED: Wed Mar  2 14:56:11 2022
NAMESPACE: default
STATUS: deployed
REVISION: 1
NOTES:
Thank you for installing HashiCorp Consul!

Your release is named consul.

To learn more about the release, run:

  $ helm status consul
  $ helm get all consul

Consul on Kubernetes Documentation:
https://www.consul.io/docs/platform/k8s

Consul on Kubernetes CLI Reference:
https://www.consul.io/docs/k8s/k8s-cli
serviceaccount/consul-client patched
serviceaccount/consul-server patched
statefulset.apps/consul-server restarted
daemonset.apps/consul-client restarted
######################################################
# Deploy openLDAP                                     
######################################################
NAME: openldap
LAST DEPLOYED: Wed Mar  2 14:56:36 2022
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None

Wait for openLDAP PODs to be ready..............

===========================================================
Setup OpenLDAP replication                                 
===========================================================
######################################################
# Deploy Zookeeper                                    
######################################################
"bitnami" has been added to your repositories
Hang tight while we grab the latest from your chart repositories...
...Successfully got an update from the "bitnami" chart repository
Update Complete. ⎈Happy Helming!⎈

Registry: ghcr.io
Repository: seagate/zookeeper
Tag: 3.7.0-debian-10-r182
NAME: zookeeper
LAST DEPLOYED: Wed Mar  2 14:56:57 2022
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
CHART NAME: zookeeper
CHART VERSION: 8.1.1
APP VERSION: 3.7.0

** Please be patient while the chart is being deployed **

ZooKeeper can be accessed via port 2181 on the following DNS name from within your cluster:

    zookeeper.default.svc.cluster.local

To connect to your ZooKeeper server run the following commands:

    export POD_NAME=$(kubectl get pods --namespace default -l "app.kubernetes.io/name=zookeeper,app.kubernetes.io/instance=zookeeper,app.kubernetes.io/component=zookeeper" -o jsonpath="{.items[0].metadata.name}")
    kubectl exec -it $POD_NAME -- zkCli.sh

To connect to your ZooKeeper server from outside the cluster execute the following commands:

    kubectl port-forward --namespace default svc/zookeeper 2181: &
    zkCli.sh 127.0.0.1:2181

Wait for Zookeeper to be ready before starting kafka

######################################################
# Deploy Kafka                                        
######################################################

Registry: ghcr.io
Repository: seagate/kafka
Tag: 3.0.0-debian-10-r7
Error: INSTALLATION FAILED: timed out waiting for the condition

Wait for CORTX 3rd party to be ready.....................................................

Crashed Pod Description

Here is a description of the crashed Kafka pod:

[root@master-node cc]# kubectl get pod
NAME                  READY   STATUS             RESTARTS       AGE
consul-client-cs7b7   0/1     Running            0              3h8m
consul-server-0       1/1     Running            0              3h8m
kafka-0               0/1     CrashLoopBackOff   40 (44s ago)   3h7m
openldap-0            1/1     Running            0              3h8m
zookeeper-0           1/1     Running            0              3h7m
[root@master-node cc]# kubectl describe pod kafka
Name:         kafka-0
Namespace:    default
Priority:     0
Node:         worker-node-1/10.52.1.106
Start Time:   Wed, 02 Mar 2022 14:57:30 +0000
Labels:       app.kubernetes.io/component=kafka
              app.kubernetes.io/instance=kafka
              app.kubernetes.io/managed-by=Helm
              app.kubernetes.io/name=kafka
              controller-revision-hash=kafka-866fd78b49
              helm.sh/chart=kafka-15.3.4
              statefulset.kubernetes.io/pod-name=kafka-0
Annotations:  <none>
Status:       Running
IP:           10.32.0.7
IPs:
  IP:           10.32.0.7
Controlled By:  StatefulSet/kafka
Containers:
  kafka:
    Container ID:  docker://fdd090e633af20142df15e3d69869c38317e654d37081b3c349e729e076c8563
    Image:         ghcr.io/seagate/kafka:3.0.0-debian-10-r7
    Image ID:      docker-pullable://ghcr.io/seagate/kafka@sha256:91155a01d7dc9de2e3909002b3c9fa308c8124d525de88e2acd55f1b95a8341d
    Ports:         9092/TCP, 9093/TCP
    Host Ports:    0/TCP, 0/TCP
    Command:
      /scripts/setup.sh
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Wed, 02 Mar 2022 18:03:58 +0000
      Finished:     Wed, 02 Mar 2022 18:04:08 +0000
    Ready:          False
    Restart Count:  40
    Limits:
      cpu:     1
      memory:  2Gi
    Requests:
      cpu:      250m
      memory:   1Gi
    Liveness:   tcp-socket :kafka-client delay=10s timeout=5s period=10s #success=1 #failure=3
    Readiness:  tcp-socket :kafka-client delay=5s timeout=5s period=10s #success=1 #failure=6
    Environment:
      BITNAMI_DEBUG:                                       false
      MY_POD_IP:                                            (v1:status.podIP)
      MY_POD_NAME:                                         kafka-0 (v1:metadata.name)
      KAFKA_CFG_ZOOKEEPER_CONNECT:                         zookeeper.default.svc.cluster.local
      KAFKA_INTER_BROKER_LISTENER_NAME:                    INTERNAL
      KAFKA_CFG_LISTENER_SECURITY_PROTOCOL_MAP:            INTERNAL:PLAINTEXT,CLIENT:PLAINTEXT
      KAFKA_CFG_LISTENERS:                                 INTERNAL://:9093,CLIENT://:9092
      KAFKA_CFG_ADVERTISED_LISTENERS:                      INTERNAL://$(MY_POD_NAME).kafka-headless.default.svc.cluster.local:9093,CLIENT://$(MY_POD_NAME).kafka-headless.default.svc.cluster.local:9092
      ALLOW_PLAINTEXT_LISTENER:                            yes
      KAFKA_VOLUME_DIR:                                    /bitnami/kafka
      KAFKA_LOG_DIR:                                       /opt/bitnami/kafka/logs
      KAFKA_CFG_DELETE_TOPIC_ENABLE:                       true
      KAFKA_CFG_AUTO_CREATE_TOPICS_ENABLE:                 true
      KAFKA_HEAP_OPTS:                                     -Xmx1024m -Xms1024m
      KAFKA_CFG_LOG_FLUSH_INTERVAL_MESSAGES:               10000
      KAFKA_CFG_LOG_FLUSH_INTERVAL_MS:                     1000
      KAFKA_CFG_LOG_RETENTION_BYTES:                       1073741824
      KAFKA_CFG_LOG_RETENTION_CHECK_INTERVALS_MS:          300000
      KAFKA_CFG_LOG_RETENTION_HOURS:                       168
      KAFKA_CFG_MESSAGE_MAX_BYTES:                         1000012
      KAFKA_CFG_LOG_SEGMENT_BYTES:                         1073741824
      KAFKA_CFG_LOG_DIRS:                                  /bitnami/kafka/data
      KAFKA_CFG_DEFAULT_REPLICATION_FACTOR:                1
      KAFKA_CFG_OFFSETS_TOPIC_REPLICATION_FACTOR:          1
      KAFKA_CFG_TRANSACTION_STATE_LOG_REPLICATION_FACTOR:  1
      KAFKA_CFG_TRANSACTION_STATE_LOG_MIN_ISR:             2
      KAFKA_CFG_NUM_IO_THREADS:                            8
      KAFKA_CFG_NUM_NETWORK_THREADS:                       3
      KAFKA_CFG_NUM_PARTITIONS:                            1
      KAFKA_CFG_NUM_RECOVERY_THREADS_PER_DATA_DIR:         1
      KAFKA_CFG_SOCKET_RECEIVE_BUFFER_BYTES:               102400
      KAFKA_CFG_SOCKET_REQUEST_MAX_BYTES:                  104857600
      KAFKA_CFG_SOCKET_SEND_BUFFER_BYTES:                  102400
      KAFKA_CFG_ZOOKEEPER_CONNECTION_TIMEOUT_MS:           6000
      KAFKA_CFG_AUTHORIZER_CLASS_NAME:                     
      KAFKA_CFG_ALLOW_EVERYONE_IF_NO_ACL_FOUND:            true
      KAFKA_CFG_SUPER_USERS:                               User:admin
      KAFKA_CFG_LOG_SEGMENT_DELETE_DELAY_MS:               1000
      KAFKA_CFG_LOG_FLUSH_OFFSET_CHECKPOINT_INTERVAL_MS:   1000
      KAFKA_CFG_LOG_RETENTION_CHECK_INTERVAL_MS:           1000
    Mounts:
      /bitnami/kafka from data (rw)
      /opt/bitnami/kafka/logs from logs (rw)
      /scripts/setup.sh from scripts (rw,path="setup.sh")
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  data:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  data-kafka-0
    ReadOnly:   false
  scripts:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      kafka-scripts
    Optional:  false
  logs:
    Type:        EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:      
    SizeLimit:   <unset>
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason   Age                     From     Message
  ----     ------   ----                    ----     -------
  Normal   Pulled   29m (x35 over 3h8m)     kubelet  Container image "ghcr.io/seagate/kafka:3.0.0-debian-10-r7" already present on machine
  Warning  BackOff  4m19s (x847 over 3h8m)  kubelet  Back-off restarting failed container

Disk layout

Repartitioned the disks and rebooted the server many times, but still couldn't get over the Kafka deployment issue. Wondered may I ask for some help on what the issue might be?

Below is my disk layout, and I ran ./prereq-deploy-cortx-cloud.sh /dev/sdb1 with the disk parameter as /dev/sdb1.

[root@master-node cc]# lsblk
NAME   MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda      8:0    0   1.8T  0 disk 
sdb      8:16   0   1.8T  0 disk 
└─sdb1   8:17   0   1.8T  0 part 
sdc      8:32   0   1.8T  0 disk 
└─sdc1   8:33   0   1.8T  0 part 
sdd      8:48   0   1.8T  0 disk 
└─sdd1   8:49   0   1.8T  0 part 
sde      8:64   0   1.8T  0 disk 
└─sde1   8:65   0   1.8T  0 part 
sdf      8:80   0   1.8T  0 disk 
sdg      8:96   0   1.8T  0 disk 
sdh      8:112  0   1.8T  0 disk 
sdi      8:128  0   1.8T  0 disk 
sdj      8:144  0   1.8T  0 disk 
sdk      8:160  0   1.8T  0 disk 
sdl      8:176  0   1.8T  0 disk 
sdm      8:192  0   1.8T  0 disk 
sdn      8:208  0   1.8T  0 disk 
sdo      8:224  0   1.8T  0 disk 
sdp      8:240  0   1.8T  0 disk 
sdq     65:0    0 372.6G  0 disk 
└─sdq1  65:1    0 372.6G  0 part /

Solution.yaml:

solution:
  namespace: default
  secrets:
    name: cortx-secret
    content:
      openldap_admin_secret: seagate1
      kafka_admin_secret: Seagate@123
      consul_admin_secret: Seagate@123
      common_admin_secret: Seagate@123
      s3_auth_admin_secret: ldapadmin
      csm_auth_admin_secret: seagate2
      csm_mgmt_admin_secret: Cortxadmin@123
  images:
    cortxcontrol: cortx-docker.colo.seagate.com/seagate/cortx-all:2.0.0-2192-custom-ci
    cortxdata: cortx-docker.colo.seagate.com/seagate/cortx-all:2.0.0-2192-custom-ci
    cortxserver: cortx-docker.colo.seagate.com/seagate/cortx-rgw:2.0.0-120-custom-ci
    cortxha: cortx-docker.colo.seagate.com/seagate/cortx-all:2.0.0-2192-custom-ci
    cortxclient: cortx-docker.colo.seagate.com/seagate/cortx-all:2.0.0-2192-custom-ci
    openldap: ghcr.io/seagate/symas-openldap:2.4.58
    consul: ghcr.io/seagate/consul:1.10.0
    kafka: ghcr.io/seagate/kafka:3.0.0-debian-10-r7
    zookeeper: ghcr.io/seagate/zookeeper:3.7.0-debian-10-r182
    rancher: ghcr.io/seagate/local-path-provisioner:v0.0.20
    busybox: ghcr.io/seagate/busybox:latest
  common:
    setup_size: large
    storage_provisioner_path: /mnt/fs-local-volume
    container_path:
      local: /etc/cortx
      shared: /share
      log: /etc/cortx/log
    s3:
      default_iam_users:
        auth_admin: "sgiamadmin"
        auth_user: "user_name"
        #auth_secret defined above in solution.secrets.content.s3_auth_admin_secret
      num_inst: 2
      start_port_num: 28051
      max_start_timeout: 240
    motr:
      num_client_inst: 0
      start_port_num: 29000
    hax:
      protocol: https
      service_name: cortx-hax-svc
      port_num: 22003
    storage_sets:
      name: storage-set-1
      durability:
        sns: 1+0+0
        dix: 1+0+0
    external_services:
      type: LoadBalancer
    resource_allocation:
      consul:
        server:
          storage: 10Gi
          resources:
            requests:
              memory: 100Mi
              cpu: 100m
            limits:
              memory: 300Mi
              cpu: 100m
        client:
          resources:
            requests:
              memory: 100Mi
              cpu: 100m
            limits:
              memory: 300Mi
              cpu: 100m
      openldap:
        resources:
          requests:
            memory: 1Gi
            cpu: 2
          limits:
            memory: 1Gi
            cpu: 2
      zookeeper:
        storage_request_size: 8Gi
        data_log_dir_request_size: 8Gi
        resources:
          requests:
            memory: 256Mi
            cpu: 250m
          limits:
            memory: 512Mi
            cpu: 500m
      kafka:
        storage_request_size: 8Gi
        log_persistence_request_size: 8Gi
        resources:
          requests:
            memory: 1Gi
            cpu: 250m
          limits:
            memory: 2Gi
            cpu: 1
  storage:
    cvg1:
      name: cvg-01
      type: ios
      devices:
        metadata:
          device: /dev/sdh
          size: 5Gi
        data:
          d1:
            device: /dev/sdi
            size: 5Gi
  nodes:
    node1:
      name: worker-node-1

Sorry I was a little new, and had been trying for a few days. Anything suggestion would help!

Thanks in advance!

cortx-admin commented 2 years ago

For the convenience of the Seagate development team, this issue has been mirrored in a private Seagate Jira Server: https://jts.seagate.com/browse/CORTX-29167. Note that community members will not be able to access that Jira server but that is not a problem since all activity in that Jira mirror will be copied into this GitHub issue.

walterlopatka commented 2 years ago

Hi @faradawn , thanks for the detailed information. I am not sure of the problem, but I can advise on a few changes and request a bit more info if that doesn't work.

First, I recommend doing your initial release with a labeled release. The most recent release is v0.0.22. (I can see from the cortx images in solution.yaml that you are probably working on the integration branch.)

Second, prereq-deploy-cortx-cloud.sh expects a block device to be specified a parameter, not a partition. (In fact, prereq attempts to create a file system on the device and then mounts it.) You can try that prereq again, or you can just make sure that /dev/sdb1 is mounted at /mnt/fs-local-file-system. (Or if already mounted somewhere else, then update solution.yaml to point at this file system (solution>common>storage_provisioner_path).

If these changes don't help (and they might not), can you please post here whatever information you can get from kubectl logs kafka-0?

Thanks, Walter

faradawn commented 2 years ago

Hi Walter @walterlopatka , Thanks so much for you careful reply! Made the following changes: 1) used the new solution.ymal from the main branch (I assumed it contained the most recent releases of images?) 2) passed a disk (instead of a partition) into the prereq script

[Edit: I found that Consul pod failed before Kafka, thus looking into its log. Thinking it might be port issue, will open the port 53 and 8000-9000 and retry!]

Consul Pod Log

[root@master-node cc]# kubectl logs consul-server-1 
==> Starting Consul agent...
           Version: '1.10.0'
           Node ID: 'f9fb533a-c52f-b4db-7a03-95e51471d14d'
         Node name: 'consul-server-1'
        Datacenter: 'dc1' (Segment: '<all>')
            Server: true (Bootstrap: false)
       Client Addr: [0.0.0.0] (HTTP: 8500, HTTPS: -1, gRPC: -1, DNS: 8600)
      Cluster Addr: 10.32.0.4 (LAN: 8301, WAN: 8302)
           Encrypt: Gossip: false, TLS-Outgoing: false, TLS-Incoming: false, Auto-Encrypt-TLS: false
==> Log data will now stream in as it occurs:
2022-03-09T05:19:15.930Z [WARN]  agent: bootstrap_expect = 2: A cluster with 2 servers will provide no failure tolerance. See https://www.consul.io/docs/internals/consensus.html#deployment-table
2022-03-09T05:19:15.930Z [WARN]  agent: bootstrap_expect > 0: expecting 2 servers
2022-03-09T05:19:16.013Z [WARN]  agent.auto_config: bootstrap_expect = 2: A cluster with 2 servers will provide no failure tolerance. See https://www.consul.io/docs/internals/consensus.html#deployment-table
2022-03-09T05:19:16.013Z [WARN]  agent.auto_config: bootstrap_expect > 0: expecting 2 servers
2022-03-09T05:19:16.124Z [INFO]  agent.server.raft: initial configuration: index=0 servers=[]
2022-03-09T05:19:16.124Z [INFO]  agent.server.raft: entering follower state: follower="Node at 10.32.0.4:8300 [Follower]" leader=
2022-03-09T05:19:16.125Z [INFO]  agent.server.serf.wan: serf: EventMemberJoin: consul-server-1.dc1 10.32.0.4
2022-03-09T05:19:16.125Z [WARN]  agent.server.serf.wan: serf: Failed to re-join any previously known node
2022-03-09T05:19:16.125Z [INFO]  agent.server.serf.lan: serf: EventMemberJoin: consul-server-1 10.32.0.4
2022-03-09T05:19:16.125Z [INFO]  agent.router: Initializing LAN area manager
2022-03-09T05:19:16.125Z [WARN]  agent.server.serf.lan: serf: Failed to re-join any previously known node
2022-03-09T05:19:16.125Z [INFO]  agent.server: Adding LAN server: server="consul-server-1 (Addr: tcp/10.32.0.4:8300) (DC: dc1)"
2022-03-09T05:19:16.125Z [WARN]  agent: grpc: addrConn.createTransport failed to connect to {10.32.0.4:8300 0 consul-server-1 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.32.0.4:8300: operation was canceled". Reconnecting...
2022-03-09T05:19:16.125Z [INFO]  agent.server: Handled event for server in area: event=member-join server=consul-server-1.dc1 area=wan
2022-03-09T05:19:16.126Z [INFO]  agent: Started DNS server: address=0.0.0.0:8600 network=tcp
2022-03-09T05:19:16.208Z [INFO]  agent: Started DNS server: address=0.0.0.0:8600 network=udp
2022-03-09T05:19:16.209Z [INFO]  agent: Starting server: address=[::]:8500 network=tcp protocol=http
2022-03-09T05:19:16.209Z [WARN]  agent: DEPRECATED Backwards compatibility with pre-1.9 metrics enabled. These metrics will be removed in a future version of Consul. Set `telemetry { disable_compat_1.9 = true }` to disable them.
2022-03-09T05:19:16.209Z [INFO]  agent: Retry join is supported for the following discovery methods: cluster=LAN discovery_methods="aliyun aws azure digitalocean gce k8s linode mdns os packet scaleway softlayer tencentcloud triton vsphere"
2022-03-09T05:19:16.209Z [INFO]  agent: Joining cluster...: cluster=LAN
2022-03-09T05:19:16.209Z [INFO]  agent: (LAN) joining: lan_addresses=[consul-server-0.consul-server.default.svc:8301, consul-server-1.consul-server.default.svc:8301]
2022-03-09T05:19:16.209Z [INFO]  agent: started state syncer
2022-03-09T05:19:16.209Z [INFO]  agent: Consul agent running!
2022-03-09T05:19:21.532Z [WARN]  agent.server.raft: no known peers, aborting election
2022-03-09T05:19:23.488Z [ERROR] agent.anti_entropy: failed to sync remote state: error="No cluster leader"
2022-03-09T05:19:26.212Z [WARN]  agent.server.memberlist.lan: memberlist: Failed to resolve consul-server-0.consul-server.default.svc:8301: lookup consul-server-0.consul-server.default.svc on 10.96.0.10:53: read udp 10.32.0.4:59946->10.96.0.10:53: read: connection refused
2022-03-09T05:19:36.215Z [WARN]  agent.server.memberlist.lan: memberlist: Failed to resolve consul-server-1.consul-server.default.svc:8301: lookup consul-server-1.consul-server.default.svc on 10.96.0.10:53: read udp 10.32.0.4:36283->10.96.0.10:53: read: connection refused
2022-03-09T05:19:36.215Z [WARN]  agent: (LAN) couldn't join: number_of_nodes=0 error="2 errors occurred:
    * Failed to resolve consul-server-0.consul-server.default.svc:8301: lookup consul-server-0.consul-server.default.svc on 10.96.0.10:53: read udp 10.32.0.4:59946->10.96.0.10:53: read: connection refused
    * Failed to resolve consul-server-1.consul-server.default.svc:8301: lookup consul-server-1.consul-server.default.svc on 10.96.0.10:53: read udp 10.32.0.4:36283->10.96.0.10:53: read: connection refused
"
2022-03-09T05:19:36.215Z [WARN]  agent: Join cluster failed, will retry: cluster=LAN retry_interval=30s error=<nil>
2022-03-09T05:19:41.571Z [ERROR] agent: Failed to check for updates: error="Get "https://checkpoint-api.hashicorp.com/v1/check/consul?arch=amd64&os=linux&signature=f4981526-92a8-6a42-6c07-057b3243f162&version=1.10.0": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
2022-03-09T05:19:52.184Z [ERROR] agent: Coordinate update error: error="No cluster leader"
2022-03-09T05:19:58.052Z [ERROR] agent.anti_entropy: failed to sync remote state: error="No cluster leader"
2022-03-09T05:20:06.216Z [INFO]  agent: (LAN) joining: lan_addresses=[consul-server-0.consul-server.default.svc:8301, consul-server-1.consul-server.default.svc:8301]
2022-03-09T05:20:16.219Z [WARN]  agent.server.memberlist.lan: memberlist: Failed to resolve consul-server-0.consul-server.default.svc:8301: lookup consul-server-0.consul-server.default.svc on 10.96.0.10:53: read udp 10.32.0.4:41387->10.96.0.10:53: read: connection refused
2022-03-09T05:20:23.288Z [ERROR] agent: Coordinate update error: error="No cluster leader"
2022-03-09T05:20:23.775Z [ERROR] agent.anti_entropy: failed to sync remote state: error="No cluster leader"
2022-03-09T05:20:26.222Z [WARN]  agent.server.memberlist.lan: memberlist: Failed to resolve consul-server-1.consul-server.default.svc:8301: lookup consul-server-1.consul-server.default.svc on 10.96.0.10:53: read udp 10.32.0.4:57400->10.96.0.10:53: read: connection refused
2022-03-09T05:20:26.222Z [WARN]  agent: (LAN) couldn't join: number_of_nodes=0 error="2 errors occurred:
    * Failed to resolve consul-server-0.consul-server.default.svc:8301: lookup consul-server-0.consul-server.default.svc on 10.96.0.10:53: read udp 10.32.0.4:41387->10.96.0.10:53: read: connection refused
    * Failed to resolve consul-server-1.consul-server.default.svc:8301: lookup consul-server-1.consul-server.default.svc on 10.96.0.10:53: read udp 10.32.0.4:57400->10.96.0.10:53: read: connection refused

Deployment Error

[root@faradawn-master k8_cortx_cloud]# ./deploy-cortx-cloud.sh solution.yaml

Validate solution file result: success
Number of worker nodes detected: 2
W0309 05:14:08.086968   12198 warnings.go:70] policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
W0309 05:14:08.102833   12198 warnings.go:70] policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
NAME: cortx-platform
LAST DEPLOYED: Wed Mar  9 05:14:07 2022
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None
"hashicorp" has been added to your repositories
Hang tight while we grab the latest from your chart repositories...
...Successfully got an update from the "hashicorp" chart repository
Update Complete. ⎈Happy Helming!⎈
Install Rancher Local Path Provisionernamespace/local-path-storage created
serviceaccount/local-path-provisioner-service-account created
clusterrole.rbac.authorization.k8s.io/local-path-provisioner-role created
clusterrolebinding.rbac.authorization.k8s.io/local-path-provisioner-bind created
deployment.apps/local-path-provisioner created
storageclass.storage.k8s.io/local-path created
configmap/local-path-config created
######################################################
# Deploy Consul                                       
######################################################
Error: INSTALLATION FAILED: timed out waiting for the condition
serviceaccount/consul-client patched
serviceaccount/consul-server patched
statefulset.apps/consul-server restarted
daemonset.apps/consul-client restarted
######################################################
# Deploy Zookeeper                                    
######################################################
"bitnami" has been added to your repositories
Hang tight while we grab the latest from your chart repositories...
...Successfully got an update from the "bitnami" chart repository
Update Complete. ⎈Happy Helming!⎈

Registry: ghcr.io
Repository: seagate/zookeeper
Tag: 3.7.0-debian-10-r182
NAME: zookeeper
LAST DEPLOYED: Wed Mar  9 05:19:16 2022
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
CHART NAME: zookeeper
CHART VERSION: 8.1.1
APP VERSION: 3.7.0

** Please be patient while the chart is being deployed **

ZooKeeper can be accessed via port 2181 on the following DNS name from within your cluster:

    zookeeper.default.svc.cluster.local

To connect to your ZooKeeper server run the following commands:

    export POD_NAME=$(kubectl get pods --namespace default -l "app.kubernetes.io/name=zookeeper,app.kubernetes.io/instance=zookeeper,app.kubernetes.io/component=zookeeper" -o jsonpath="{.items[0].metadata.name}")
    kubectl exec -it $POD_NAME -- zkCli.sh

To connect to your ZooKeeper server from outside the cluster execute the following commands:

    kubectl port-forward --namespace default svc/zookeeper 2181: &
    zkCli.sh 127.0.0.1:2181

Wait for Zookeeper to be ready before starting kafka

######################################################
# Deploy Kafka                                        
######################################################

Registry: ghcr.io
Repository: seagate/kafka
Tag: 3.0.0-debian-10-r7
Error: INSTALLATION FAILED: timed out waiting for the condition

Wait for CORTX 3rd party to be ready...............................................................................................................................................

Log of kafka pod

used two nodes this time

[root@faradawn-master k8_cortx_cloud]# kubectl get pods
NAME                  READY   STATUS             RESTARTS        AGE
consul-client-5tfj5   0/1     Running            0               15m
consul-client-rk9c2   0/1     Running            0               15m
consul-server-0       0/1     Running            0               20m
consul-server-1       0/1     Running            0               15m
kafka-0               0/1     CrashLoopBackOff   7 (74s ago)     14m
kafka-1               0/1     CrashLoopBackOff   7 (2m14s ago)   14m
zookeeper-0           1/1     Running            0               15m
zookeeper-1           1/1     Running            0               15m

Kafka log

[root@faradawn-master k8_cortx_cloud]# kubectl logs kafka-0
kafka 05:33:09.27 
kafka 05:33:09.27 Welcome to the Bitnami kafka container
kafka 05:33:09.27 Subscribe to project updates by watching https://github.com/bitnami/bitnami-docker-kafka
kafka 05:33:09.27 Submit issues and feature requests at https://github.com/bitnami/bitnami-docker-kafka/issues
kafka 05:33:09.27 
kafka 05:33:09.27 INFO  ==> ** Starting Kafka setup **
kafka 05:33:09.34 WARN  ==> You set the environment variable ALLOW_PLAINTEXT_LISTENER=yes. For safety reasons, do not use this flag in a production environment.
kafka 05:33:09.35 INFO  ==> Initializing Kafka...
kafka 05:33:09.36 INFO  ==> No injected configuration files found, creating default config files
kafka 05:33:09.67 INFO  ==> Configuring Kafka for inter-broker communications with PLAINTEXT authentication.
kafka 05:33:09.68 WARN  ==> Inter-broker communications are configured as PLAINTEXT. This is not safe for production environments.
kafka 05:33:09.68 INFO  ==> Configuring Kafka for client communications with PLAINTEXT authentication.
kafka 05:33:09.69 WARN  ==> Client communications are configured using PLAINTEXT listeners. For safety reasons, do not use this in a production environment.
kafka 05:33:09.70 INFO  ==> ** Kafka setup finished! **
kafka 05:33:09.72 INFO  ==> ** Starting Kafka **
[2022-03-09 05:33:11,185] INFO Registered kafka:type=kafka.Log4jController MBean (kafka.utils.Log4jControllerRegistration$)
[2022-03-09 05:33:11,884] INFO Setting -D jdk.tls.rejectClientInitiatedRenegotiation=true to disable client-initiated TLS renegotiation (org.apache.zookeeper.common.X509Util)
[2022-03-09 05:33:12,054] INFO Registered signal handlers for TERM, INT, HUP (org.apache.kafka.common.utils.LoggingSignalHandler)
[2022-03-09 05:33:12,057] INFO starting (kafka.server.KafkaServer)
[2022-03-09 05:33:12,058] INFO Connecting to zookeeper on zookeeper.default.svc.cluster.local (kafka.server.KafkaServer)
[2022-03-09 05:33:12,072] INFO [ZooKeeperClient Kafka server] Initializing a new session to zookeeper.default.svc.cluster.local. (kafka.zookeeper.ZooKeeperClient)
[2022-03-09 05:33:12,076] INFO Client environment:zookeeper.version=3.6.3--6401e4ad2087061bc6b9f80dec2d69f2e3c8660a, built on 04/08/2021 16:35 GMT (org.apache.zookeeper.ZooKeeper)
[2022-03-09 05:33:12,076] INFO Client environment:host.name=kafka-0.kafka-headless.default.svc.cluster.local (org.apache.zookeeper.ZooKeeper)
[2022-03-09 05:33:12,076] INFO Client environment:java.version=11.0.12 (org.apache.zookeeper.ZooKeeper)
[2022-03-09 05:33:12,076] INFO Client environment:java.vendor=BellSoft (org.apache.zookeeper.ZooKeeper)
[2022-03-09 05:33:12,076] INFO Client environment:java.home=/opt/bitnami/java (org.apache.zookeeper.ZooKeeper)
[2022-03-09 05:33:12,076] INFO Client environment:java.class.path=/opt/bitnami/kafka/bin/../libs/activation-1.1.1.jar:/opt/bitnami/kafka/bin/../libs/aopalliance-repackaged-2.6.1.jar:/opt/bitnami/kafka/bin/../libs/argparse4j-0.7.0.jar:/opt/bitnami/kafka/bin/../libs/audience-annotations-0.5.0.jar:/opt/bitnami/kafka/bin/../libs/commons-cli-1.4.jar:/opt/bitnami/kafka/bin/../libs/commons-lang3-3.8.1.jar:/opt/bitnami/kafka/bin/../libs/connect-api-3.0.0.jar:/opt/bitnami/kafka/bin/../libs/connect-basic-auth-extension-3.0.0.jar:/opt/bitnami/kafka/bin/../libs/connect-file-3.0.0.jar:/opt/bitnami/kafka/bin/../libs/connect-json-3.0.0.jar:/opt/bitnami/kafka/bin/../libs/connect-mirror-3.0.0.jar:/opt/bitnami/kafka/bin/../libs/connect-mirror-client-3.0.0.jar:/opt/bitnami/kafka/bin/../libs/connect-runtime-3.0.0.jar:/opt/bitnami/kafka/bin/../libs/connect-transforms-3.0.0.jar:/opt/bitnami/kafka/bin/../libs/hk2-api-2.6.1.jar:/opt/bitnami/kafka/bin/../libs/hk2-locator-2.6.1.jar:/opt/bitnami/kafka/bin/../libs/hk2-utils-2.6.1.jar:/opt/bitnami/kafka/bin/../libs/jackson-annotations-2.12.3.jar:/opt/bitnami/kafka/bin/../libs/jackson-core-2.12.3.jar:/opt/bitnami/kafka/bin/../libs/jackson-databind-2.12.3.jar:/opt/bitnami/kafka/bin/../libs/jackson-dataformat-csv-2.12.3.jar:/opt/bitnami/kafka/bin/../libs/jackson-datatype-jdk8-2.12.3.jar:/opt/bitnami/kafka/bin/../libs/jackson-jaxrs-base-2.12.3.jar:/opt/bitnami/kafka/bin/../libs/jackson-jaxrs-json-provider-2.12.3.jar:/opt/bitnami/kafka/bin/../libs/jackson-module-jaxb-annotations-2.12.3.jar:/opt/bitnami/kafka/bin/../libs/jackson-module-scala_2.12-2.12.3.jar:/opt/bitnami/kafka/bin/../libs/jakarta.activation-api-1.2.1.jar:/opt/bitnami/kafka/bin/../libs/jakarta.annotation-api-1.3.5.jar:/opt/bitnami/kafka/bin/../libs/jakarta.inject-2.6.1.jar:/opt/bitnami/kafka/bin/../libs/jakarta.validation-api-2.0.2.jar:/opt/bitnami/kafka/bin/../libs/jakarta.ws.rs-api-2.1.6.jar:/opt/bitnami/kafka/bin/../libs/jakarta.xml.bind-api-2.3.2.jar:/opt/bitnami/kafka/bin/../libs/javassist-3.27.0-GA.jar:/opt/bitnami/kafka/bin/../libs/javax.servlet-api-3.1.0.jar:/opt/bitnami/kafka/bin/../libs/javax.ws.rs-api-2.1.1.jar:/opt/bitnami/kafka/bin/../libs/jaxb-api-2.3.0.jar:/opt/bitnami/kafka/bin/../libs/jersey-client-2.34.jar:/opt/bitnami/kafka/bin/../libs/jersey-common-2.34.jar:/opt/bitnami/kafka/bin/../libs/jersey-container-servlet-2.34.jar:/opt/bitnami/kafka/bin/../libs/jersey-container-servlet-core-2.34.jar:/opt/bitnami/kafka/bin/../libs/jersey-hk2-2.34.jar:/opt/bitnami/kafka/bin/../libs/jersey-server-2.34.jar:/opt/bitnami/kafka/bin/../libs/jetty-client-9.4.43.v20210629.jar:/opt/bitnami/kafka/bin/../libs/jetty-continuation-9.4.43.v20210629.jar:/opt/bitnami/kafka/bin/../libs/jetty-http-9.4.43.v20210629.jar:/opt/bitnami/kafka/bin/../libs/jetty-io-9.4.43.v20210629.jar:/opt/bitnami/kafka/bin/../libs/jetty-security-9.4.43.v20210629.jar:/opt/bitnami/kafka/bin/../libs/jetty-server-9.4.43.v20210629.jar:/opt/bitnami/kafka/bin/../libs/jetty-servlet-9.4.43.v20210629.jar:/opt/bitnami/kafka/bin/../libs/jetty-servlets-9.4.43.v20210629.jar:/opt/bitnami/kafka/bin/../libs/jetty-util-9.4.43.v20210629.jar:/opt/bitnami/kafka/bin/../libs/jetty-util-ajax-9.4.43.v20210629.jar:/opt/bitnami/kafka/bin/../libs/jline-3.12.1.jar:/opt/bitnami/kafka/bin/../libs/jopt-simple-5.0.4.jar:/opt/bitnami/kafka/bin/../libs/kafka-clients-3.0.0.jar:/opt/bitnami/kafka/bin/../libs/kafka-log4j-appender-3.0.0.jar:/opt/bitnami/kafka/bin/../libs/kafka-metadata-3.0.0.jar:/opt/bitnami/kafka/bin/../libs/kafka-raft-3.0.0.jar:/opt/bitnami/kafka/bin/../libs/kafka-server-common-3.0.0.jar:/opt/bitnami/kafka/bin/../libs/kafka-shell-3.0.0.jar:/opt/bitnami/kafka/bin/../libs/kafka-storage-3.0.0.jar:/opt/bitnami/kafka/bin/../libs/kafka-storage-api-3.0.0.jar:/opt/bitnami/kafka/bin/../libs/kafka-streams-3.0.0.jar:/opt/bitnami/kafka/bin/../libs/kafka-streams-examples-3.0.0.jar:/opt/bitnami/kafka/bin/../libs/kafka-streams-scala_2.12-3.0.0.jar:/opt/bitnami/kafka/bin/../libs/kafka-streams-test-utils-3.0.0.jar:/opt/bitnami/kafka/bin/../libs/kafka-tools-3.0.0.jar:/opt/bitnami/kafka/bin/../libs/kafka_2.12-3.0.0.jar:/opt/bitnami/kafka/bin/../libs/log4j-1.2.17.jar:/opt/bitnami/kafka/bin/../libs/lz4-java-1.7.1.jar:/opt/bitnami/kafka/bin/../libs/maven-artifact-3.8.1.jar:/opt/bitnami/kafka/bin/../libs/metrics-core-2.2.0.jar:/opt/bitnami/kafka/bin/../libs/metrics-core-4.1.12.1.jar:/opt/bitnami/kafka/bin/../libs/netty-buffer-4.1.62.Final.jar:/opt/bitnami/kafka/bin/../libs/netty-codec-4.1.62.Final.jar:/opt/bitnami/kafka/bin/../libs/netty-common-4.1.62.Final.jar:/opt/bitnami/kafka/bin/../libs/netty-handler-4.1.62.Final.jar:/opt/bitnami/kafka/bin/../libs/netty-resolver-4.1.62.Final.jar:/opt/bitnami/kafka/bin/../libs/netty-transport-4.1.62.Final.jar:/opt/bitnami/kafka/bin/../libs/netty-transport-native-epoll-4.1.62.Final.jar:/opt/bitnami/kafka/bin/../libs/netty-transport-native-unix-common-4.1.62.Final.jar:/opt/bitnami/kafka/bin/../libs/osgi-resource-locator-1.0.3.jar:/opt/bitnami/kafka/bin/../libs/paranamer-2.8.jar:/opt/bitnami/kafka/bin/../libs/plexus-utils-3.2.1.jar:/opt/bitnami/kafka/bin/../libs/reflections-0.9.12.jar:/opt/bitnami/kafka/bin/../libs/rocksdbjni-6.19.3.jar:/opt/bitnami/kafka/bin/../libs/scala-collection-compat_2.12-2.4.4.jar:/opt/bitnami/kafka/bin/../libs/scala-java8-compat_2.12-1.0.0.jar:/opt/bitnami/kafka/bin/../libs/scala-library-2.12.14.jar:/opt/bitnami/kafka/bin/../libs/scala-logging_2.12-3.9.3.jar:/opt/bitnami/kafka/bin/../libs/scala-reflect-2.12.14.jar:/opt/bitnami/kafka/bin/../libs/slf4j-api-1.7.30.jar:/opt/bitnami/kafka/bin/../libs/slf4j-log4j12-1.7.30.jar:/opt/bitnami/kafka/bin/../libs/snappy-java-1.1.8.1.jar:/opt/bitnami/kafka/bin/../libs/trogdor-3.0.0.jar:/opt/bitnami/kafka/bin/../libs/zookeeper-3.6.3.jar:/opt/bitnami/kafka/bin/../libs/zookeeper-jute-3.6.3.jar:/opt/bitnami/kafka/bin/../libs/zstd-jni-1.5.0-2.jar (org.apache.zookeeper.ZooKeeper)
[2022-03-09 05:33:12,077] INFO Client environment:java.library.path=/usr/java/packages/lib:/usr/lib64:/lib64:/lib:/usr/lib (org.apache.zookeeper.ZooKeeper)
[2022-03-09 05:33:12,077] INFO Client environment:java.io.tmpdir=/tmp (org.apache.zookeeper.ZooKeeper)
[2022-03-09 05:33:12,077] INFO Client environment:java.compiler=<NA> (org.apache.zookeeper.ZooKeeper)
[2022-03-09 05:33:12,077] INFO Client environment:os.name=Linux (org.apache.zookeeper.ZooKeeper)
[2022-03-09 05:33:12,077] INFO Client environment:os.arch=amd64 (org.apache.zookeeper.ZooKeeper)
[2022-03-09 05:33:12,077] INFO Client environment:os.version=3.10.0-1127.19.1.el7.x86_64 (org.apache.zookeeper.ZooKeeper)
[2022-03-09 05:33:12,077] INFO Client environment:user.name=? (org.apache.zookeeper.ZooKeeper)
[2022-03-09 05:33:12,077] INFO Client environment:user.home=? (org.apache.zookeeper.ZooKeeper)
[2022-03-09 05:33:12,077] INFO Client environment:user.dir=/ (org.apache.zookeeper.ZooKeeper)
[2022-03-09 05:33:12,077] INFO Client environment:os.memory.free=1009MB (org.apache.zookeeper.ZooKeeper)
[2022-03-09 05:33:12,077] INFO Client environment:os.memory.max=1024MB (org.apache.zookeeper.ZooKeeper)
[2022-03-09 05:33:12,077] INFO Client environment:os.memory.total=1024MB (org.apache.zookeeper.ZooKeeper)
[2022-03-09 05:33:12,079] INFO Initiating client connection, connectString=zookeeper.default.svc.cluster.local sessionTimeout=18000 watcher=kafka.zookeeper.ZooKeeperClient$ZooKeeperClientWatcher$@39a8312f (org.apache.zookeeper.ZooKeeper)
[2022-03-09 05:33:12,084] INFO jute.maxbuffer value is 4194304 Bytes (org.apache.zookeeper.ClientCnxnSocket)
[2022-03-09 05:33:12,089] INFO zookeeper.request.timeout value is 0. feature enabled=false (org.apache.zookeeper.ClientCnxn)
[2022-03-09 05:33:12,090] INFO [ZooKeeperClient Kafka server] Waiting until connected. (kafka.zookeeper.ZooKeeperClient)
[2022-03-09 05:33:18,091] INFO [ZooKeeperClient Kafka server] Closing. (kafka.zookeeper.ZooKeeperClient)
[2022-03-09 05:33:32,108] ERROR Unable to resolve address: zookeeper.default.svc.cluster.local:2181 (org.apache.zookeeper.client.StaticHostProvider)
java.net.UnknownHostException: zookeeper.default.svc.cluster.local: Temporary failure in name resolution
    at java.base/java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method)
    at java.base/java.net.InetAddress$PlatformNameService.lookupAllHostAddr(InetAddress.java:929)
    at java.base/java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1519)
    at java.base/java.net.InetAddress$NameServiceAddresses.get(InetAddress.java:848)
    at java.base/java.net.InetAddress.getAllByName0(InetAddress.java:1509)
    at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1368)
    at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1302)
    at org.apache.zookeeper.client.StaticHostProvider$1.getAllByName(StaticHostProvider.java:88)
    at org.apache.zookeeper.client.StaticHostProvider.resolve(StaticHostProvider.java:141)
    at org.apache.zookeeper.client.StaticHostProvider.next(StaticHostProvider.java:368)
    at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1207)
[2022-03-09 05:33:32,116] WARN An exception was thrown while closing send thread for session 0x0. (org.apache.zookeeper.ClientCnxn)
java.lang.IllegalArgumentException: Unable to canonicalize address zookeeper.default.svc.cluster.local:2181 because it's not resolvable
    at org.apache.zookeeper.SaslServerPrincipal.getServerPrincipal(SaslServerPrincipal.java:78)
    at org.apache.zookeeper.SaslServerPrincipal.getServerPrincipal(SaslServerPrincipal.java:41)
    at org.apache.zookeeper.ClientCnxn$SendThread.startConnect(ClientCnxn.java:1161)
    at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1210)
[2022-03-09 05:33:32,221] INFO Session: 0x0 closed (org.apache.zookeeper.ZooKeeper)
[2022-03-09 05:33:32,222] INFO EventThread shut down for session: 0x0 (org.apache.zookeeper.ClientCnxn)
[2022-03-09 05:33:32,223] INFO [ZooKeeperClient Kafka server] Closed. (kafka.zookeeper.ZooKeeperClient)
[2022-03-09 05:33:32,226] ERROR Fatal error during KafkaServer startup. Prepare to shutdown (kafka.server.KafkaServer)
kafka.zookeeper.ZooKeeperClientTimeoutException: Timed out waiting for connection while in state: CONNECTING
    at kafka.zookeeper.ZooKeeperClient.$anonfun$waitUntilConnected$3(ZooKeeperClient.scala:254)
    at kafka.zookeeper.ZooKeeperClient.waitUntilConnected(ZooKeeperClient.scala:250)
    at kafka.zookeeper.ZooKeeperClient.<init>(ZooKeeperClient.scala:108)
    at kafka.zk.KafkaZkClient$.apply(KafkaZkClient.scala:1981)
    at kafka.server.KafkaServer.initZkClient(KafkaServer.scala:457)
    at kafka.server.KafkaServer.startup(KafkaServer.scala:196)
    at kafka.Kafka$.main(Kafka.scala:109)
    at kafka.Kafka.main(Kafka.scala)
[2022-03-09 05:33:32,227] INFO shutting down (kafka.server.KafkaServer)
[2022-03-09 05:33:32,232] INFO App info kafka.server for 0 unregistered (org.apache.kafka.common.utils.AppInfoParser)
[2022-03-09 05:33:32,232] INFO shut down completed (kafka.server.KafkaServer)
[2022-03-09 05:33:32,233] ERROR Exiting Kafka. (kafka.Kafka$)
[2022-03-09 05:33:32,233] INFO shutting down (kafka.server.KafkaServer)

Disk Layout and Solution.yaml

[root@faradawn-master k8_cortx_cloud]# lsblk
NAME   MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda      8:0    0   1.8T  0 disk 
sdb      8:16   0   1.8T  0 disk /mnt/fs-local-volume
sdc      8:32   0   1.8T  0 disk 
sdd      8:48   0   1.8T  0 disk 
sde      8:64   0   1.8T  0 disk 
sdf      8:80   0   1.8T  0 disk 
sdg      8:96   0   1.8T  0 disk 
sdh      8:112  0   1.8T  0 disk 
sdi      8:128  0   1.8T  0 disk 
sdj      8:144  0   1.8T  0 disk 
sdk      8:160  0   1.8T  0 disk 
sdl      8:176  0   1.8T  0 disk 
sdm      8:192  0   1.8T  0 disk 
sdn      8:208  0   1.8T  0 disk 
sdo      8:224  0   1.8T  0 disk 
sdp      8:240  0   1.8T  0 disk 
sdq     65:0    0 372.6G  0 disk 
└─sdq1  65:1    0 372.6G  0 part /

solution.ymal

solution:
  namespace: default
  secrets:
    name: cortx-secret
    content:
      kafka_admin_secret: Seagate@123
      consul_admin_secret: Seagate@123
      common_admin_secret: Seagate@123
      s3_auth_admin_secret: cortxadmin
      csm_auth_admin_secret: seagate2
      csm_mgmt_admin_secret: Cortxadmin@123
  images:
    cortxcontrol: ghcr.io/seagate/cortx-all:2.0.0-664
    cortxdata: ghcr.io/seagate/cortx-all:2.0.0-664
    cortxserver: ghcr.io/seagate/cortx-rgw:2.0.0-664
    cortxha: ghcr.io/seagate/cortx-all:2.0.0-664
    cortxclient: ghcr.io/seagate/cortx-all:2.0.0-664
    consul: ghcr.io/seagate/consul:1.10.0
    kafka: ghcr.io/seagate/kafka:3.0.0-debian-10-r7
    zookeeper: ghcr.io/seagate/zookeeper:3.7.0-debian-10-r182
    rancher: ghcr.io/seagate/local-path-provisioner:v0.0.20
    busybox: ghcr.io/seagate/busybox:latest
  common:
    setup_size: large
    storage_provisioner_path: /mnt/fs-local-volume
    container_path:
      local: /etc/cortx
      shared: /share
      log: /etc/cortx/log
    s3:
      default_iam_users:
        auth_admin: "sgiamadmin"
        auth_user: "user_name"
        #auth_secret defined above in solution.secrets.content.s3_auth_admin_secret
      num_inst: 2
      start_port_num: 28051
      max_start_timeout: 240
    motr:
      num_client_inst: 0
      start_port_num: 29000
    hax:
      protocol: https
      service_name: cortx-hax-svc
      port_num: 22003
    storage_sets:
      name: storage-set-1
      durability:
        sns: 1+0+0
        dix: 1+0+0
    external_services:
      s3:
        type: NodePort
        count: 1
        ports:
          http: 8000
          https: 8443
        nodePorts:
          http: ""
          https: ""
      control:
        type: NodePort
        ports:
          https: 8081
        nodePorts:
          https: ""
    resource_allocation:
      consul:
        server:
          storage: 10Gi
          resources:
            requests:
              memory: 100Mi
              cpu: 100m
            limits:
              memory: 300Mi
              cpu: 100m
        client:
          resources:
            requests:
              memory: 100Mi
              cpu: 100m
            limits:
              memory: 300Mi
              cpu: 100m
      zookeeper:
        storage_request_size: 8Gi
        data_log_dir_request_size: 8Gi
        resources:
          requests:
            memory: 256Mi
            cpu: 250m
          limits:
            memory: 512Mi
            cpu: 500m
      kafka:
        storage_request_size: 8Gi
        log_persistence_request_size: 8Gi
        resources:
          requests:
            memory: 1Gi
            cpu: 250m
          limits:
            memory: 2Gi
            cpu: 1
  storage:
    cvg1:
      name: cvg-01
      type: ios
      devices:
        metadata:
          device: /dev/sdc
          size: 5Gi
        data:
          d1:
            device: /dev/sdd
            size: 5Gi
          d2:
            device: /dev/sde
            size: 5Gi
  nodes:
    node1:
      name: worker-node-1
    node2:
      name: worker-node-2

It must be a lot of information! It's okay if it takes too much time -- I can keep trying!
You must be very busy -- cannot thank you enough if you would take a look!
If there is anything I could do, please let me know!

Thanks, Faradawn

walterlopatka commented 2 years ago

Hi @faradawn , I'm sorry for the late response. I was out of the office on vacation last week.

From this error I see that the zookeeper service name is not resolving:

[2022-03-09 05:33:32,108] ERROR Unable to resolve address: zookeeper.default.svc.cluster.local:2181 (org.apache.zookeeper.client.StaticHostProvider)
java.net.UnknownHostException: zookeeper.default.svc.cluster.local: Temporary failure in name resolution

(I also see that the consul pods are not running.)

This is a symptom of problems with Kubernetes networking (though I am not sure that is the problem). I do not have deep expertise in Kubernetes networking / CNI. I am using Calico and running on Centos 7.9. What are you running?) My recent experience is that running Calico on Rocky Linux 8 has some problems, so I know that there can be issues between OS and K8s CNI.

I'm not sure what to advise yet, but I am curious on what OS you are running, and how your k8s is deployed. You might consider some of the diagnostic steps described here if you haven't done any network diagnosis yet.

faradawn commented 2 years ago

Hi Walter,

Thanks so much for the reply!

I was running on Centos 7, using the Weavenet network.

As for how the k8s was deployed, here is an installation script that I ran:

echo -e '\n === Part1: install kubernetes and docker === \n'

cat <<EOF > /etc/yum.repos.d/kubernetes.repo
[kubernetes]
name=Kubernetes
baseurl=https://packages.cloud.google.com/yum/repos/kubernetes-el7-x86_64
enabled=1
gpgkey=https://packages.cloud.google.com/yum/doc/yum-key.gpg https://packages.cloud.google.com/yum/doc/rpm-package-key.gpg
EOF

yum check-update
echo y | yum install -y yum-utils device-mapper-persistent-data lvm2 firewalld docker kubelet kubeadm kubectl

systemctl enable docker && systemctl start docker
systemctl enable kubelet && systemctl start kubelet

echo -e '\n === Part2: configure firewall === \n'

cat <<EOF>> /etc/hosts
10.52.0.60 master-node
10.52.0.242 worker-node-1
10.52.3.14 worker-node-2
EOF
systemctl start firewalld
  sudo firewall-cmd --permanent --add-port=6443/tcp
  sudo firewall-cmd --permanent --add-port=2379-2380/tcp
  sudo firewall-cmd --permanent --add-port=10250/tcp
  sudo firewall-cmd --permanent --add-port=10251/tcp
  sudo firewall-cmd --permanent --add-port=10252/tcp
  sudo firewall-cmd --permanent --add-port=10255/tcp
  sudo firewall-cmd --permanent --add-port=53-60000/tcp
  sudo firewall-cmd --permanent --add-port=53-60000/udp
  sudo firewall-cmd --reload

# update IP table
cat <<EOF > /etc/sysctl.d/k8s.conf
net.bridge.bridge-nf-call-ip6tables = 1
net.bridge.bridge-nf-call-iptables = 1
EOF
sysctl --system
# SElinx permissive mode
setenforce 0
sed -i --follow-symlinks 's/SELINUX=enforcing/SELINUX=disabled/g' /etc/sysconfig/selinux
sed -i '/swap/d' /etc/fstab
swapoff -a
echo -e '\n === Part3: Kuber Init ===\n'
if [[ $ME -eq "master" ]]
then
  kubeadm init
  mkdir -p $HOME/.kube && cp -i /etc/kubernetes/admin.conf $HOME/.kube/config && chown $(id -u):$(id -g) $HOME/.kube/config
  export kubever=$(kubectl version | base64 | tr -d '\n')
  kubectl apply -f "https://cloud.weave.works/k8s/net?k8s-version=$kubever"
fi
echo -e '\n === done! === \n'

Thanks for suggesting that the problem might relate to CNI, perhaps I can try Calio! Received the k8s diagnostic guide -- it was a great resource! Will look into it try some debugging techniques!

Thanks, Faradawn

faradawn commented 2 years ago

Hi Walter,

Progress was made -- owning to your suggestion! The following seemed to resolved Consul and Kafka deployment issue!

using main branch instead of integration
using Calio instead of Weavenet
disabling ufw firewall
changing deployment from big to small
including master in nodes in solution.yaml (might be a problem?)
different disks for path-provisioner on master and the worker node (might be a problem?)

Now, the deployment seemed to fail on deploying CORTX data pods:

########################################################
# Deploy CORTX Data                                     
########################################################
NAME: cortx-data-worker-node-1-default
LAST DEPLOYED: Fri Mar 18 04:34:42 2022
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None
NAME: cortx-data-worker-node-2-default
LAST DEPLOYED: Fri Mar 18 04:34:43 2022
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None

Wait for CORTX Data to be ready......................................................................timed out waiting for the condition on deployments/cortx-data-worker-node-1
timed out waiting for the condition on deployments/cortx-data-worker-node-2
.....................................................................................................................^C

[root@worker-node-1 k8_cortx_cloud]# kubectl get pods
NAME                                        READY   STATUS     RESTARTS   AGE
consul-client-5ss94                         1/1     Running    0          15m
consul-server-0                             1/1     Running    0          15m
cortx-control-85f5858cdb-lj4dj              1/1     Running    0          13m
cortx-data-worker-node-1-74f688c58d-56l2n   0/3     Pending    0          12m
cortx-data-worker-node-2-65b588468f-ntcst   0/3     Init:0/2   0          12m
kafka-0                                     1/1     Running    0          14m
openldap-0                                  1/1     Running    0          15m
zookeeper-0                                 1/1     Running    0          14m

[root@worker-node-1 k8_cortx_cloud]# kubectl logs cortx-data-worker-node-1-74f688c58d-56l2n
error: a container name must be specified for pod cortx-data-worker-node-1-74f688c58d-56l2n, choose one of: [cortx-hax cortx-motr-confd cortx-motr-io-001] or one of the init containers: [cortx-setup node-config]

I wondered does it have to do with whether to include the control plane in the list of nodes in solution.yaml? Here, the worker-node-1 is the master (control plane). Do you think maybe I should exclude it (control plane) from the list of nodes?

  nodes:
    node1:
      name: worker-node-1
    node2:
      name: worker-node-2

[root@worker-node-1 k8_cortx_cloud]# kubectl get nodes
NAME            STATUS   ROLES                  AGE    VERSION
worker-node-1   Ready    control-plane,master   171m   v1.23.5
worker-node-2   Ready    <none>                 150m   v1.23.5

Planning test the following:

exclude master node in the nodes section in solution.yaml
allocate only two disks for storage pods instead of three
keep deployment size as small

Thanks in advance!

Best, Faradawn

walterlopatka commented 2 years ago

Hi @faradawn , great progress! I'm glad to see the 3rd party containers are starting up.

From you listing, it looks like your master node is tainted so that it will not schedule any pods. You can confirm like this:

kubectl describe node worker-node-1 | grep Taint

If it replies with something like Taints: Taints: node-role.kubernetes.io/master:NoSchedule then it is tainted and will not schedule any pods. In a production environment it makes sense to separate the k8s master from the workers, but in test environments it's simpler and makes more equipment available to allow workers to be scheduled on the master.

You can remove the taint with:

kubectl taint node worker-node-1 node-role.kubernetes.io/master:NoSchedule-

(or replace node-role.kubernetes.io/master with whatever value you see in the above kubectl describe node | grep Taint output)

After that you should see two pods for each of the third party sw.

Another way that you can confirm this is by running kubectl describe pod cortx-data-worker-node-1-74f688c58d-56l2n (the one that is pending), and the Events section will say something like "FailedScheduling" and something about a taint.

Best regards, Walter

faradawn commented 2 years ago

Hi Walter,

Thanks for much for the information on taints! Excluding the master node from the list of node on which CORTX deploy data pods resolved the "data pod timeout" issue! Just to confirm, does the follow output implies a successful deployment?

[root@master k8_cortx_cloud]# kubectl exec -it $DATA_POD -c cortx-hax -- /bin/bash -c "hctl status"
Byte_count:
    critical_byte_count : 0
    damaged_byte_count : 0
    degraded_byte_count : 0
    healthy_byte_count : 0
Data pool:
    # fid name
    0x6f00000000000001:0x23 'storage-set-1__sns'
Profile:
    # fid name: pool(s)
    0x7000000000000001:0x39 'Profile_the_pool': 'storage-set-1__sns' 'storage-set-1__dix' None
Services:
    cortx-server-headless-svc-node-1 
    [started]  hax        0x7200000000000001:0x1b  inet:tcp:cortx-server-headless-svc-node-1@22001
    [started]  rgw        0x7200000000000001:0x1e  inet:tcp:cortx-server-headless-svc-node-1@21501
    cortx-data-headless-svc-node-1  (RC)
    [started]  hax        0x7200000000000001:0x6   inet:tcp:cortx-data-headless-svc-node-1@22001
    [started]  ioservice  0x7200000000000001:0x9   inet:tcp:cortx-data-headless-svc-node-1@21001
    [started]  confd      0x7200000000000001:0x16  inet:tcp:cortx-data-headless-svc-node-1@21002

[Edit: the deployment was successful. Can perform S3 IOs!]

As for a summary of the issue:

1 - Installing Kubernetes

two CC-CentOS7-2003, one master, one worker
used Calio as CNI, such as following this guide
removed RAID, LVM, or partitions on the disks for CORTX to deploy pods on
passed a disk instead of partition to prereq script: ./prereq-deploy-cortx-cloud.sh /dev/sdk

2 - Resolving Kafka failure

was due to previous Consul pod's unsuccessful installation
opened the required ports for Consul. Or disable ufw or firewalld

3 - Resolving CORTX deployment issue

checked out the main branch for the latestest cortx images: git clone -b main https://github.com/Seagate/cortx-k8s
didn't include master node in the nodes section in solution.yaml
can try setup_size: small, and try one disk for metadata and two disks for data
can try adding special character to csm_auth_admin_secret: seagate2!

Thanks so much, Walter, for resolving the issue for me for the past 2 weeks!

If there is anything I could do, please let me know!

Best, Faradawn

cortx-admin commented 2 years ago

Walter Lopatka commented in Jira Server:

NA

cortx-admin commented 2 years ago

Walter Lopatka commented in Jira Server:

Closed in GitHub

cortx-admin commented 2 years ago

Walter Lopatka commented in Jira Server:

Closed in GitHub

Seagate / cortx-k8s