Seagate / cortx-k8s

CORTX Kubernetes Orchestration Repository
https://github.com/Seagate/cortx
Apache License 2.0
6 stars 47 forks source link

[Solved] cortx serve pod failed for during deployment #355

Closed faradawn closed 8 months ago

faradawn commented 2 years ago

Problem

cortx-server-0 and cortx-data-g0-0 could not be initialized during deployment.

Expected behavior

Should be able to deploy. [See the new comment below]

How to reproduce

Can follow the commands from this guide https://github.com/faradawn/tutorials/blob/main/linux/cortx/README.md

CORTX on Kubernetes version

main, v0.10.0 cortx_k8s commit id: 7a920e1a967a99b561952aa584b8658abeb99b4f Date: Mon Aug 15 17:02:52 2022 -0700

Deployment information

Kubernetes version: v1.24 kubectl version: v1.24

Solution configuration file YAML

solution:
  namespace: default
  deployment_type: standard
  secrets:
    name: cortx-secret
    content:
      kafka_admin_secret: null
      consul_admin_secret: null
      common_admin_secret: null
      s3_auth_admin_secret: null
      csm_auth_admin_secret: null
      csm_mgmt_admin_secret: Cortx123!
  images:
    cortxcontrol: ghcr.io/seagate/cortx-control:2.0.0-895
    cortxdata: ghcr.io/seagate/cortx-data:2.0.0-895
    cortxserver: ghcr.io/seagate/cortx-rgw:2.0.0-895
    cortxha: ghcr.io/seagate/cortx-control:2.0.0-895
    cortxclient: ghcr.io/seagate/cortx-data:2.0.0-895
    consul: ghcr.io/seagate/consul:1.11.4
    kafka: ghcr.io/seagate/kafka:3.0.0-debian-10-r97
    zookeeper: ghcr.io/seagate/zookeeper:3.8.0-debian-10-r9
    rancher: ghcr.io/seagate/local-path-provisioner:v0.0.20
    busybox: ghcr.io/seagate/busybox:latest
  common:
    storage_provisioner_path: /mnt/fs-local-volume
    s3:
      default_iam_users:
        auth_admin: "sgiamadmin"
        auth_user: "user_name"
        #auth_secret defined above in solution.secrets.content.s3_auth_admin_secret
      max_start_timeout: 240
      instances_per_node: 1
      extra_configuration: ""
    motr:
      num_client_inst: 0
      extra_configuration: ""
    hax:
      protocol: https
      port_num: 22003
    external_services:
      s3:
        type: NodePort
        count: 1
        ports:
          http: 80
          https: 443
        nodePorts:
          http: null
          https: null
      control:
        type: NodePort
        ports:
          https: 8081
        nodePorts:
          https: null
    resource_allocation:
      consul:
        server:
          storage: 10Gi
          resources:
            requests:
              memory: 200Mi
              cpu: 200m
            limits:
              memory: 500Mi
              cpu: 500m
        client:
          resources:
            requests:
              memory: 200Mi
              cpu: 200m
            limits:
              memory: 500Mi
              cpu: 500m
      zookeeper:
        storage_request_size: 8Gi
        data_log_dir_request_size: 8Gi
        resources:
          requests:
            memory: 256Mi
            cpu: 250m
          limits:
            memory: 512Mi
            cpu: 500m
      kafka:
        storage_request_size: 8Gi
        resources:
          requests:
            memory: 1Gi
            cpu: 250m
          limits:
            memory: 2Gi
            cpu: 1000m
      hare:
        hax:
          resources:
            requests:
              memory: 128Mi
              cpu: 250m
            limits:
              memory: 2Gi
              cpu: 1000m
      data:
        motr:
          resources:
            requests:
              memory: 1Gi
              cpu: 250m
            limits:
              memory: 2Gi
              cpu: 1000m
        confd:
          resources:
            requests:
              memory: 128Mi
              cpu: 250m
            limits:
              memory: 512Mi
              cpu: 500m
      server:
        rgw:
          resources:
            requests:
              memory: 128Mi
              cpu: 250m
            limits:
              memory: 2Gi
              cpu: 2000m
      control:
        agent:
          resources:
            requests:
              memory: 128Mi
              cpu: 250m
            limits:
              memory: 256Mi
              cpu: 500m
      ha:
        fault_tolerance:
          resources:
            requests:
              memory: 128Mi
              cpu: 250m
            limits:
              memory: 1Gi
              cpu: 500m
        health_monitor:
          resources:
            requests:
              memory: 128Mi
              cpu: 250m
            limits:
              memory: 1Gi
              cpu: 500m
        k8s_monitor:
          resources:
            requests:
              memory: 128Mi
              cpu: 250m
            limits:
              memory: 1Gi
              cpu: 500m
  storage_sets:
    - name: storage-set-1
      durability:
        sns: 1+0+0
        dix: 1+0+0
      container_group_size: 1
      nodes:
        - sky-2.novalocal
      storage:
        - name: cvg-01
          type: ios
          devices:
            metadata:
              - path: /dev/loop1
                size: 5Gi
            data:
              - path: /dev/loop2
                size: 5Gi

Logs

Get all pods

[cc@sky-2 k8_cortx_cloud]$ kc get pods -A -o wide
NAMESPACE            NAME                                      READY   STATUS     RESTARTS      AGE   IP            NODE              NOMINATED NODE   READINESS GATES
default              cortx-consul-client-926b4                 1/1     Running    0             14m   10.85.0.26    sky-2.novalocal   <none>           <none>
default              cortx-consul-server-0                     1/1     Running    0             14m   10.85.0.32    sky-2.novalocal   <none>           <none>
default              cortx-control-f4b57d4dd-c486j             1/1     Running    0             14m   10.85.0.25    sky-2.novalocal   <none>           <none>
default              cortx-data-g0-0                           0/3     Init:0/2   0             14m   <none>        sky-2.novalocal   <none>           <none>
default              cortx-ha-56fb4b495-ptrps                  3/3     Running    0             14m   10.85.0.35    sky-2.novalocal   <none>           <none>
default              cortx-kafka-0                             1/1     Running    0             14m   10.85.0.37    sky-2.novalocal   <none>           <none>
default              cortx-server-0                            0/2     Init:0/1   0             14m   10.85.0.33    sky-2.novalocal   <none>           <none>
default              cortx-zookeeper-0                         1/1     Running    0             14m   10.85.0.36    sky-2.novalocal   <none>           <none>
kube-system          coredns-5769f8787-l4vzb                   1/1     Running    0             61s   10.85.0.38    sky-2.novalocal   <none>           <none>
kube-system          coredns-5769f8787-rrqxn                   1/1     Running    0             61s   10.85.0.39    sky-2.novalocal   <none>           <none>
kube-system          etcd-sky-2.novalocal                      1/1     Running    0             74m   10.52.2.232   sky-2.novalocal   <none>           <none>
kube-system          kube-apiserver-sky-2.novalocal            1/1     Running    1 (27m ago)   74m   10.52.2.232   sky-2.novalocal   <none>           <none>
kube-system          kube-controller-manager-sky-2.novalocal   1/1     Running    5 (16m ago)   74m   10.52.2.232   sky-2.novalocal   <none>           <none>
kube-system          kube-proxy-ksjjr                          1/1     Running    0             73m   10.52.2.232   sky-2.novalocal   <none>           <none>
kube-system          kube-scheduler-sky-2.novalocal            1/1     Running    6 (16m ago)   73m   10.52.2.232   sky-2.novalocal   <none>           <none>
local-path-storage   local-path-provisioner-7f45fdfb8-86rz6    1/1     Running    0             54m   10.85.0.4     sky-2.novalocal   <none>           <none>

Describe server-pod

[cc@sky-2 k8_cortx_cloud]$ kc describe pod  cortx-server-0
Name:         cortx-server-0
Namespace:    default
Priority:     0
Node:         sky-2.novalocal/10.52.2.232

Conditions:
  Type              Status
  Initialized       False 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  data:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  data-cortx-server-0
    ReadOnly:   false
  cortx-configuration:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      cortx
    Optional:  false
  cortx-ssl-cert:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      cortx-ssl-cert
    Optional:  false
  configuration-secrets:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  cortx-secret
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  14m   default-scheduler  Successfully assigned default/cortx-server-0 to sky-2.novalocal
  Normal  Pulled     14m   kubelet            Container image "ghcr.io/seagate/cortx-rgw:2.0.0-895" already present on machine
  Normal  Created    14m   kubelet            Created container cortx-setup
  Normal  Started    14m   kubelet            Started container cortx-setup

Tried to get lods

[cc@sky-2 k8_cortx_cloud]$ kc logs cortx-server-0
Defaulted container "cortx-hax" out of: cortx-hax, cortx-rgw, cortx-setup (init)
Error from server (BadRequest): container "cortx-hax" in pod "cortx-server-0" is waiting to start: PodInitializing

Additional information

cortx-admin commented 2 years ago

For the convenience of the Seagate development team, this issue has been mirrored in a private Seagate Jira Server: https://jts.seagate.com/browse/CORTX-33974. Note that community members will not be able to access that Jira server but that is not a problem since all activity in that Jira mirror will be copied into this GitHub issue.

faradawn commented 2 years ago

Hi,

Just resolved the issue by not applying the Calico network plugin! So the problem might be Calico hindering Cortx's server pod from initializing (maybe!)

Here is the deployment detail:

Thanks, and can close this issue anytime!

hessio commented 2 years ago

Hi @faradawn thanks for this update! Do you think we should create an issue about Calico network plugin?

faradawn commented 2 years ago

Hi @hessio,

Just tried again and found that the Calico issue could also be resolved by restarting the core-dns pods before the CORTX deployment!

kubectl rollout restart -n kube-system deployment/coredns

Maybe can add the above command in troubleshooting, for example

If applied Calico network plugin and the deployment failed due to "timeout waiting for cortx-server or cortx-data-g0 to initialize", can try the following:

If there is anything I can do, please let me know!

hessio commented 2 years ago

Thanks a lot for finding this issue!

shailesh-vaidya commented 8 months ago

Closing as an obsolete