GoogleCloudPlatform / kubeflow-distribution

Blueprints for Deploying Kubeflow on Google Cloud Platform and Anthos
Apache License 2.0
78 stars 63 forks source link

Validate the E2E workflow using https://github.com/kubeflow/pipelines/blob/master/samples/contrib/kubeflow-e2e-mnist/kubeflow-e2e-mnist.ipynb #317

Closed zijianjoy closed 2 years ago

zijianjoy commented 3 years ago

Spec

Kubeflow 1.4-rc.0 Platform: GCP K8s: GKE 1.19.12-gke.2101 Image type: Container-Optimized OS with Docker (cos) Library versions:

Status

Katib - Launch experiment step is hanging until time out:

hangingkatib

cc @andreyvelich

andreyvelich commented 3 years ago

@zijianjoy Can you check the logs from the Trials' pods ?

zijianjoy commented 3 years ago

Thank you @andreyvelich ,

The pods show running state with message containers with unready status: [tensorflow]. Some logs are as followed:


Info
2021-09-20 07:32:32.575 PDT2021-09-20T14:32:32.575568Z warn xdsproxy upstream terminated with unexpected error rpc error: code = Unknown desc = closing transport due to: connection error: desc = "error reading from server: EOF", received prior goaway: code: NO_ERROR, debug data:
Error
2021-09-20 07:32:32.576 PDT2021-09-20T14:32:32.576102Z warning envoy config StreamAggregatedResources gRPC config stream closed: 2, closing transport due to: connection error: desc = "error reading from server: EOF", received prior goaway: code: NO_ERROR, debug data:
Info
2021-09-20 07:32:32.984 PDT2021-09-20T14:32:32.984432Z info xdsproxy connected to upstream XDS server: istiod-asm-1104-6.istio-system.svc:15012
Info
2021-09-20 08:01:06.388 PDT2021-09-20T15:01:06.388170Z warn xdsproxy upstream terminated with unexpected error rpc error: code = Unknown desc = closing transport due to: connection error: desc = "error reading from server: EOF", received prior goaway: code: NO_ERROR, debug data:
Error
2021-09-20 08:01:06.388 PDT2021-09-20T15:01:06.388622Z warning envoy config StreamAggregatedResources gRPC config stream closed: 2, closing transport due to: connection error: desc = "error reading from server: EOF", received prior goaway: code: NO_ERROR, debug data:
Info
2021-09-20 08:01:06.788 PDT2021-09-20T15:01:06.788749Z info xdsproxy connected to upstream XDS server: istiod-asm-1104-6.istio-system.svc:15012
Info
2021-09-20 08:18:18.863 PDT2021-09-20T15:18:18.863440Z info token Prepared federated token request, aud: identitynamespace:jamxl-kfp-dev.svc.id.goog:https://container.googleapis.com/v1/projects/jamxl-kfp-dev/locations/us-west1-b/clusters/kf-ref20, STS endpoint: https://sts.googleapis.com/v1/token
Info
2021-09-20 08:18:18.894 PDT2021-09-20T15:18:18.893881Z info token Received federated token response after 30.358548ms
Info
2021-09-20 08:18:18.894 PDT2021-09-20T15:18:18.894284Z info token Federated token will expire in 3600 seconds
Info
2021-09-20 08:18:18.894 PDT2021-09-20T15:18:18.894343Z info token Prepared access token request
Info
2021-09-20 08:18:18.932 PDT2021-09-20T15:18:18.932706Z info token Received access token response after 38.347492ms
andreyvelich commented 3 years ago

Can you also describe this pod ? kubectl describe pod <pod-name> -n kubeflow-user-example-com

zijianjoy commented 3 years ago

Sure thing! getting the log for pod:

 k describe pod  mnist-e2e-2-vdblgt8r-chief-0 -n namespace1
Name:         mnist-e2e-2-vdblgt8r-chief-0
Namespace:    namespace1
Priority:     0
Node:         gke-kf-ref20-default-pool-e69057f9-8alv/10.138.0.33
Start Time:   Sun, 19 Sep 2021 23:33:12 +0000
Labels:       group-name=kubeflow.org
              istio.io/rev=asm-1104-6
              job-name=mnist-e2e-2-vdblgt8r
              job-role=master
              replica-index=0
              replica-type=chief
              security.istio.io/tlsMode=istio
              service.istio.io/canonical-name=mnist-e2e-2-vdblgt8r-chief-0
              service.istio.io/canonical-revision=latest
              topology.istio.io/network=jamxl-kfp-dev-default
Annotations:  kubectl.kubernetes.io/default-container: tensorflow
              kubectl.kubernetes.io/default-logs-container: tensorflow
              prometheus.io/path: /stats/prometheus
              prometheus.io/port: 15020
              prometheus.io/scrape: true
              sidecar.istio.io/status:
                {"initContainers":["istio-init"],"containers":["istio-proxy"],"volumes":["istio-envoy","istio-data","istio-podinfo","istio-token"],"imageP...
Status:       Running
IP:           10.84.3.40
IPs:
  IP:           10.84.3.40
Controlled By:  TFJob/mnist-e2e-2-vdblgt8r
Init Containers:
  istio-init:
    Container ID:  docker://e2abc8508805b6284951ffed2b506e8e226d838d262f75c019dba54592759fb4
    Image:         gcr.io/gke-release/asm/proxyv2:1.10.4-asm.6
    Image ID:      docker-pullable://gcr.io/gke-release/asm/proxyv2@sha256:011f852521d2859d711ddb0c8b93da5dfabd0ec79e8710f02ff05de278472617
    Port:          <none>
    Host Port:     <none>
    Args:
      istio-iptables
      -p
      15001
      -z
      15006
      -u
      1337
      -m
      REDIRECT
      -i
      *
      -x

      -b
      *
      -d
      15090,15021,15020
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Sun, 19 Sep 2021 23:33:14 +0000
      Finished:     Sun, 19 Sep 2021 23:33:14 +0000
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     2
      memory:  1Gi
    Requests:
      cpu:     100m
      memory:  128Mi
    Environment:
      CA_PROVIDER:        GoogleCA
      GCE_METADATA_HOST:  metadata.google.internal
      GCP_METADATA:       jamxl-kfp-dev|1027430761074|kf-ref20|us-west1-b
      GKE_CLUSTER_URL:    https://container.googleapis.com/v1/projects/jamxl-kfp-dev/locations/us-west1-b/clusters/kf-ref20
      PLUGINS:            GoogleTokenExchange
      USE_TOKEN_FOR_CSR:  true
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-fwll6 (ro)
Containers:
  tensorflow:
    Container ID:  docker://1caa0ff13f19aded56c27d5c754bad3b5ac571df14fd26dfc77820f30c9d99ba
    Image:         docker.io/liuhougangxa/tf-estimator-mnist
    Image ID:      docker-pullable://liuhougangxa/tf-estimator-mnist@sha256:00669e66921e9d93f981d354a066be2c7965f30bb646fcccb9ad621b419c03c8
    Port:          2222/TCP
    Host Port:     0/TCP
    Command:
      sh
      -c
    Args:
      python /opt/model.py --tf-train-steps=200 --tf-learning-rate=0.04915907370779257 --tf-batch-size=88 1>/var/log/katib/metrics.log 2>&1 && echo completed > /var/log/katib/$$$$.pid
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Sun, 19 Sep 2021 23:33:18 +0000
      Finished:     Sun, 19 Sep 2021 23:33:50 +0000
    Ready:          False
    Restart Count:  0
    Environment:
      TF_CONFIG:  {"cluster":{"chief":["mnist-e2e-2-vdblgt8r-chief-0.namespace1.svc:2222"],"worker":["mnist-e2e-2-vdblgt8r-worker-0.namespace1.svc:2222"]},"task":{"type":"chief","index":0},"environment":"cloud"}
    Mounts:
      /var/log/katib from metrics-volume (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-fwll6 (ro)
  istio-proxy:
    Container ID:  docker://34ab8ecfc9d7e5f7f286eee391d318ac255c4b0782e642ffb4c60367472f6119
    Image:         gcr.io/gke-release/asm/proxyv2:1.10.4-asm.6
    Image ID:      docker-pullable://gcr.io/gke-release/asm/proxyv2@sha256:011f852521d2859d711ddb0c8b93da5dfabd0ec79e8710f02ff05de278472617
    Port:          15090/TCP
    Host Port:     0/TCP
    Args:
      proxy
      sidecar
      --domain
      $(POD_NAMESPACE).svc.cluster.local
      --serviceCluster
      mnist-e2e-2-vdblgt8r-chief-0.namespace1
      --proxyLogLevel=warning
      --proxyComponentLogLevel=misc:error
      --log_output_level=default:info
      --stsPort=15463
      --concurrency
      2
    State:          Running
      Started:      Sun, 19 Sep 2021 23:33:19 +0000
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     2
      memory:  1Gi
    Requests:
      cpu:      100m
      memory:   128Mi
    Readiness:  http-get http://:15021/healthz/ready delay=1s timeout=3s period=2s #success=1 #failure=30
    Environment:
      JWT_POLICY:                    third-party-jwt
      PILOT_CERT_PROVIDER:           kubernetes
      CA_ADDR:                       meshca.googleapis.com:443
      POD_NAME:                      mnist-e2e-2-vdblgt8r-chief-0 (v1:metadata.name)
      POD_NAMESPACE:                 namespace1 (v1:metadata.namespace)
      INSTANCE_IP:                    (v1:status.podIP)
      SERVICE_ACCOUNT:                (v1:spec.serviceAccountName)
      HOST_IP:                        (v1:status.hostIP)
      CANONICAL_SERVICE:              (v1:metadata.labels['service.istio.io/canonical-name'])
      CANONICAL_REVISION:             (v1:metadata.labels['service.istio.io/canonical-revision'])
      PROXY_CONFIG:                  {"discoveryAddress":"istiod-asm-1104-6.istio-system.svc:15012","proxyMetadata":{"CA_PROVIDER":"GoogleCA","GCE_METADATA_HOST":"metadata.google.internal","GCP_METADATA":"jamxl-kfp-dev|1027430761074|kf-ref20|us-west1-b","GKE_CLUSTER_URL":"https://container.googleapis.com/v1/projects/jamxl-kfp-dev/locations/us-west1-b/clusters/kf-ref20","PLUGINS":"GoogleTokenExchange","USE_TOKEN_FOR_CSR":"true"},"meshId":"proj-1027430761074"}

      ISTIO_META_POD_PORTS:          [
                                         {"name":"tfjob-port","containerPort":2222,"protocol":"TCP"}
                                     ]
      ISTIO_META_APP_CONTAINERS:     tensorflow
      ISTIO_META_INTERCEPTION_MODE:  REDIRECT
      ISTIO_META_WORKLOAD_NAME:      mnist-e2e-2-vdblgt8r-chief-0
      ISTIO_META_OWNER:              kubernetes://apis/v1/namespaces/namespace1/pods/mnist-e2e-2-vdblgt8r-chief-0
      ISTIO_META_MESH_ID:            proj-1027430761074
      TRUST_DOMAIN:                  jamxl-kfp-dev.svc.id.goog
      CA_PROVIDER:                   GoogleCA
      GCE_METADATA_HOST:             metadata.google.internal
      GCP_METADATA:                  jamxl-kfp-dev|1027430761074|kf-ref20|us-west1-b
      GKE_CLUSTER_URL:               https://container.googleapis.com/v1/projects/jamxl-kfp-dev/locations/us-west1-b/clusters/kf-ref20
      PLUGINS:                       GoogleTokenExchange
      USE_TOKEN_FOR_CSR:             true
      ISTIO_META_CLUSTER_ID:         cn-jamxl-kfp-dev-us-west1-b-kf-ref20
      ISTIO_META_NETWORK:            jamxl-kfp-dev-default
    Mounts:
      /etc/istio/pod from istio-podinfo (rw)
      /etc/istio/proxy from istio-envoy (rw)
      /var/lib/istio/data from istio-data (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-fwll6 (ro)
      /var/run/secrets/tokens from istio-token (rw)
  metrics-logger-and-collector:
    Container ID:  docker://fa0f82026bf3373e788ac70a0cf533412edae0a2eabaa1deca8524b9d3bc4b3b
    Image:         docker.io/kubeflowkatib/file-metrics-collector:v0.12.0-rc.1
    Image ID:      docker-pullable://kubeflowkatib/file-metrics-collector@sha256:5870bd94383293af0c7baaa800e451a9ff9c8621cd565d4c09e76cf3371c7088
    Port:          <none>
    Host Port:     <none>
    Args:
      -t
      mnist-e2e-2-vdblgt8r
      -m
      loss
      -o-type
      minimize
      -s-db
      katib-db-manager.kubeflow:6789
      -path
      /var/log/katib/metrics.log
    State:          Running
      Started:      Sun, 19 Sep 2021 23:33:21 +0000
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:                500m
      ephemeral-storage:  5Gi
      memory:             100Mi
    Requests:
      cpu:                50m
      ephemeral-storage:  500Mi
      memory:             10Mi
    Environment:          <none>
    Mounts:
      /var/log/katib from metrics-volume (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-fwll6 (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  istio-envoy:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     Memory
    SizeLimit:  <unset>
  istio-data:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  istio-podinfo:
    Type:  DownwardAPI (a volume populated by information about the pod)
    Items:
      metadata.labels -> labels
      metadata.annotations -> annotations
      limits.cpu -> cpu-limit
      requests.cpu -> cpu-request
  istio-token:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  43200
  default-token-fwll6:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-fwll6
    Optional:    false
  metrics-volume:
    Type:        EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:      
    SizeLimit:   <unset>
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:          <none>
andreyvelich commented 3 years ago

For some reason, Istio sidecar annotation has not been disabled on the TFJob training containers. I think it is related to this issue: https://github.com/kubeflow/tf-operator/issues/1403 cc @kubeflow/wg-training-leads

Jeffwan commented 3 years ago

em. Let me have a check and try to bump the version first.

andreyvelich commented 3 years ago

@zijianjoy I guess https://github.com/kubeflow/manifests/pull/2032 should resolve this issue, please can you try to use the latest training operator controller image and run it again.

zijianjoy commented 3 years ago

Thank you @Jeffwan and @andreyvelich ! I validated that the latest change to use training operator v1.3.0-rc.2 fixed the issue.

cc @kimwnasptd , can release team pick this up for new manifest release candidate? https://github.com/kubeflow/manifests/pull/2032 Thank you so much!