awslabs / kubeflow-manifests

KubeFlow on AWS
https://awslabs.github.io/kubeflow-manifests/
Apache License 2.0
162 stars 119 forks source link

Kubeflow 1.4.1 on EKS 1.20 install failed #269

Closed zorrofox closed 2 years ago

zorrofox commented 2 years ago

Describe the bug Some pods report missing mysql-secret error, such as this:

Warning  Failed  54m (x256 over 109m)    kubelet  Error: secret "mysql-secret" not found

Steps To Reproduce

Flow the document to install the Kubeflow.

Expected behavior A clear and concise description of what you expected to happen.

Environment

Screenshots

image

Additional context Add any other context about the problem here.

surajkota commented 2 years ago

Hi @zorrofox, thanks for using Kubeflow on AWS. aws-secrets-sync deployment is supposed to create the mysql-secret and ml-pipeline-minio-artifact and I see that the corresponding pod is not in Running status. Can you describe the pod to see if there is an error? get the pod id by using the following command: kubectl get pods -n kubeflow | grep "aws-secrets-sync"

use the pod id to check the state:

export POD_ID=<pod-id-here>
kubectl describe pod -n kubeflow $POD_ID
kubectl logs -n kubeflow $POD_ID -c <container-name>
zorrofox commented 2 years ago

@surajkota thanks a lot for your help! The aws-secrets-sync pod deployment info:

kubectl describe -n kubeflow pod aws-secrets-sync-78bf8674fd-j8nxr
Name:           aws-secrets-sync-78bf8674fd-j8nxr
Namespace:      kubeflow
Priority:       0
Node:           ip-192-168-55-104.us-west-2.compute.internal/192.168.55.104
Start Time:     Wed, 22 Jun 2022 21:37:07 +0800
Labels:         app=aws-secrets-sync
                istio.io/rev=default
                pod-template-hash=78bf8674fd
                security.istio.io/tlsMode=istio
                service.istio.io/canonical-name=aws-secrets-sync
                service.istio.io/canonical-revision=latest
Annotations:    kubectl.kubernetes.io/default-logs-container: secrets
                kubernetes.io/psp: eks.privileged
                prometheus.io/path: /stats/prometheus
                prometheus.io/port: 15020
                prometheus.io/scrape: true
                sidecar.istio.io/status:
                  {"initContainers":["istio-init"],"containers":["istio-proxy"],"volumes":["istio-envoy","istio-data","istio-podinfo","istio-token","istiod-...
Status:         Pending
IP:             
IPs:            <none>
Controlled By:  ReplicaSet/aws-secrets-sync-78bf8674fd
Init Containers:
  istio-init:
    Container ID:  
    Image:         docker.io/istio/proxyv2:1.9.6
    Image ID:      
    Port:          <none>
    Host Port:     <none>
    Args:
      istio-iptables
      -p
      15001
      -z
      15006
      -u
      1337
      -m
      REDIRECT
      -i
      *
      -x

      -b
      *
      -d
      15090,15021,15020
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:     2
      memory:  1Gi
    Requests:
      cpu:     10m
      memory:  40Mi
    Environment:
      AWS_DEFAULT_REGION:           us-west-2
      AWS_REGION:                   us-west-2
      AWS_ROLE_ARN:                 arn:aws:iam::975230531453:role/eksctl-kubeflow-workshop-addon-iamserviceacc-Role1-148SO2I187UW4
      AWS_WEB_IDENTITY_TOKEN_FILE:  /var/run/secrets/eks.amazonaws.com/serviceaccount/token
    Mounts:
      /var/run/secrets/eks.amazonaws.com/serviceaccount from aws-iam-token (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kubeflow-secrets-manager-sa-token-grl4m (ro)
Containers:
  secrets:
    Container ID:   
    Image:          public.ecr.aws/xray/aws-xray-daemon:latest
    Image ID:       
    Port:           <none>
    Host Port:      <none>
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Environment:
      AWS_DEFAULT_REGION:           us-west-2
      AWS_REGION:                   us-west-2
      AWS_ROLE_ARN:                 arn:aws:iam::975230531453:role/eksctl-kubeflow-workshop-addon-iamserviceacc-Role1-148SO2I187UW4
      AWS_WEB_IDENTITY_TOKEN_FILE:  /var/run/secrets/eks.amazonaws.com/serviceaccount/token
    Mounts:
      /mnt/rds-store from rds-secret (ro)
      /mnt/s3-store from s3-secret (ro)
      /var/run/secrets/eks.amazonaws.com/serviceaccount from aws-iam-token (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kubeflow-secrets-manager-sa-token-grl4m (ro)
  istio-proxy:
    Container ID:  
    Image:         docker.io/istio/proxyv2:1.9.6
    Image ID:      
    Port:          15090/TCP
    Host Port:     0/TCP
    Args:
      proxy
      sidecar
      --domain
      $(POD_NAMESPACE).svc.cluster.local
      --serviceCluster
      aws-secrets-sync.$(POD_NAMESPACE)
      --proxyLogLevel=warning
      --proxyComponentLogLevel=misc:error
      --log_output_level=default:info
      --concurrency
      2
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:     2
      memory:  1Gi
    Requests:
      cpu:      10m
      memory:   40Mi
    Readiness:  http-get http://:15021/healthz/ready delay=1s timeout=3s period=2s #success=1 #failure=30
    Environment:
      JWT_POLICY:                    third-party-jwt
      PILOT_CERT_PROVIDER:           istiod
      CA_ADDR:                       istiod.istio-system.svc:15012
      POD_NAME:                      aws-secrets-sync-78bf8674fd-j8nxr (v1:metadata.name)
      POD_NAMESPACE:                 kubeflow (v1:metadata.namespace)
      INSTANCE_IP:                    (v1:status.podIP)
      SERVICE_ACCOUNT:                (v1:spec.serviceAccountName)
      HOST_IP:                        (v1:status.hostIP)
      CANONICAL_SERVICE:              (v1:metadata.labels['service.istio.io/canonical-name'])
      CANONICAL_REVISION:             (v1:metadata.labels['service.istio.io/canonical-revision'])
      PROXY_CONFIG:                  {"tracing":{}}

      ISTIO_META_POD_PORTS:          [
                                     ]
      ISTIO_META_APP_CONTAINERS:     secrets
      ISTIO_META_CLUSTER_ID:         Kubernetes
      ISTIO_META_INTERCEPTION_MODE:  REDIRECT
      ISTIO_METAJSON_ANNOTATIONS:    {"kubernetes.io/psp":"eks.privileged"}

      ISTIO_META_WORKLOAD_NAME:      aws-secrets-sync
      ISTIO_META_OWNER:              kubernetes://apis/apps/v1/namespaces/kubeflow/deployments/aws-secrets-sync
      ISTIO_META_MESH_ID:            cluster.local
      TRUST_DOMAIN:                  cluster.local
      AWS_DEFAULT_REGION:            us-west-2
      AWS_REGION:                    us-west-2
      AWS_ROLE_ARN:                  arn:aws:iam::975230531453:role/eksctl-kubeflow-workshop-addon-iamserviceacc-Role1-148SO2I187UW4
      AWS_WEB_IDENTITY_TOKEN_FILE:   /var/run/secrets/eks.amazonaws.com/serviceaccount/token
    Mounts:
      /etc/istio/pod from istio-podinfo (rw)
      /etc/istio/proxy from istio-envoy (rw)
      /var/lib/istio/data from istio-data (rw)
      /var/run/secrets/eks.amazonaws.com/serviceaccount from aws-iam-token (ro)
      /var/run/secrets/istio from istiod-ca-cert (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kubeflow-secrets-manager-sa-token-grl4m (ro)
      /var/run/secrets/tokens from istio-token (rw)
Conditions:
  Type              Status
  Initialized       False 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  aws-iam-token:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  86400
  istio-envoy:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     Memory
    SizeLimit:  <unset>
  istio-data:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  istio-podinfo:
    Type:  DownwardAPI (a volume populated by information about the pod)
    Items:
      metadata.labels -> labels
      metadata.annotations -> annotations
      limits.cpu -> cpu-limit
      requests.cpu -> cpu-request
  istio-token:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  43200
  istiod-ca-cert:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      istio-ca-root-cert
    Optional:  false
  s3-secret:
    Type:              CSI (a Container Storage Interface (CSI) volume source)
    Driver:            secrets-store.csi.k8s.io
    FSType:            
    ReadOnly:          true
    VolumeAttributes:      secretProviderClass=s3-secret
  rds-secret:
    Type:              CSI (a Container Storage Interface (CSI) volume source)
    Driver:            secrets-store.csi.k8s.io
    FSType:            
    ReadOnly:          true
    VolumeAttributes:      secretProviderClass=rds-secret
  kubeflow-secrets-manager-sa-token-grl4m:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  kubeflow-secrets-manager-sa-token-grl4m
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason       Age                    From     Message
  ----     ------       ----                   ----     -------
  Warning  FailedMount  10m (x451 over 11h)    kubelet  (combined from similar events): Unable to attach or mount volumes: unmounted volumes=[s3-secret rds-secret], unattached volumes=[istio-token kubeflow-secrets-manager-sa-token-grl4m istiod-ca-cert aws-iam-token s3-secret istio-envoy istio-podinfo rds-secret istio-data]: timed out waiting for the condition
  Warning  FailedMount  3m52s (x250 over 11h)  kubelet  MountVolume.SetUp failed for volume "s3-secret" : kubernetes.io/csi: mounter.SetUpAt failed to get CSI client: driver name secrets-store.csi.k8s.io not found in the list of registered CSI drivers

And the the containers may have any logs output and I can find that the driver name secrets-store.csi.k8s.io not found in the list of registered CSI drivers error.

But I can get the driver:

kubectl get csidriver
NAME                       ATTACHREQUIRED   PODINFOONMOUNT   TOKENREQUESTS   REQUIRESREPUBLISH   MODES        AGE
efs.csi.aws.com            false            false            <unset>         false               Persistent   41h
secrets-store.csi.k8s.io   false            true             <unset>         false               Ephemeral    23h
surajkota commented 2 years ago

Can you please run through the troubleshooting steps mentioned in this post: https://aws.amazon.com/premiumsupport/knowledge-center/eks-troubleshoot-secrets-manager-issues/ and verify if the secrets-store-csi-driver pods are in Running state and deamonset has required DESIRED and CURRENT based on number of nodes in your cluster?

zorrofox commented 2 years ago

Hi @surajkota ,

Thanks a lot for your help! I just find a ymal deployment file from the secret provider deployed failed due to network issues.