Closed zijianjoy closed 2 years ago
@zijianjoy Can you check the logs from the Trials' pods ?
Thank you @andreyvelich ,
The pods show running state with message containers with unready status: [tensorflow]
. Some logs are as followed:
Info
2021-09-20 07:32:32.575 PDT2021-09-20T14:32:32.575568Z warn xdsproxy upstream terminated with unexpected error rpc error: code = Unknown desc = closing transport due to: connection error: desc = "error reading from server: EOF", received prior goaway: code: NO_ERROR, debug data:
Error
2021-09-20 07:32:32.576 PDT2021-09-20T14:32:32.576102Z warning envoy config StreamAggregatedResources gRPC config stream closed: 2, closing transport due to: connection error: desc = "error reading from server: EOF", received prior goaway: code: NO_ERROR, debug data:
Info
2021-09-20 07:32:32.984 PDT2021-09-20T14:32:32.984432Z info xdsproxy connected to upstream XDS server: istiod-asm-1104-6.istio-system.svc:15012
Info
2021-09-20 08:01:06.388 PDT2021-09-20T15:01:06.388170Z warn xdsproxy upstream terminated with unexpected error rpc error: code = Unknown desc = closing transport due to: connection error: desc = "error reading from server: EOF", received prior goaway: code: NO_ERROR, debug data:
Error
2021-09-20 08:01:06.388 PDT2021-09-20T15:01:06.388622Z warning envoy config StreamAggregatedResources gRPC config stream closed: 2, closing transport due to: connection error: desc = "error reading from server: EOF", received prior goaway: code: NO_ERROR, debug data:
Info
2021-09-20 08:01:06.788 PDT2021-09-20T15:01:06.788749Z info xdsproxy connected to upstream XDS server: istiod-asm-1104-6.istio-system.svc:15012
Info
2021-09-20 08:18:18.863 PDT2021-09-20T15:18:18.863440Z info token Prepared federated token request, aud: identitynamespace:jamxl-kfp-dev.svc.id.goog:https://container.googleapis.com/v1/projects/jamxl-kfp-dev/locations/us-west1-b/clusters/kf-ref20, STS endpoint: https://sts.googleapis.com/v1/token
Info
2021-09-20 08:18:18.894 PDT2021-09-20T15:18:18.893881Z info token Received federated token response after 30.358548ms
Info
2021-09-20 08:18:18.894 PDT2021-09-20T15:18:18.894284Z info token Federated token will expire in 3600 seconds
Info
2021-09-20 08:18:18.894 PDT2021-09-20T15:18:18.894343Z info token Prepared access token request
Info
2021-09-20 08:18:18.932 PDT2021-09-20T15:18:18.932706Z info token Received access token response after 38.347492ms
Can you also describe
this pod ?
kubectl describe pod <pod-name> -n kubeflow-user-example-com
Sure thing! getting the log for pod:
k describe pod mnist-e2e-2-vdblgt8r-chief-0 -n namespace1
Name: mnist-e2e-2-vdblgt8r-chief-0
Namespace: namespace1
Priority: 0
Node: gke-kf-ref20-default-pool-e69057f9-8alv/10.138.0.33
Start Time: Sun, 19 Sep 2021 23:33:12 +0000
Labels: group-name=kubeflow.org
istio.io/rev=asm-1104-6
job-name=mnist-e2e-2-vdblgt8r
job-role=master
replica-index=0
replica-type=chief
security.istio.io/tlsMode=istio
service.istio.io/canonical-name=mnist-e2e-2-vdblgt8r-chief-0
service.istio.io/canonical-revision=latest
topology.istio.io/network=jamxl-kfp-dev-default
Annotations: kubectl.kubernetes.io/default-container: tensorflow
kubectl.kubernetes.io/default-logs-container: tensorflow
prometheus.io/path: /stats/prometheus
prometheus.io/port: 15020
prometheus.io/scrape: true
sidecar.istio.io/status:
{"initContainers":["istio-init"],"containers":["istio-proxy"],"volumes":["istio-envoy","istio-data","istio-podinfo","istio-token"],"imageP...
Status: Running
IP: 10.84.3.40
IPs:
IP: 10.84.3.40
Controlled By: TFJob/mnist-e2e-2-vdblgt8r
Init Containers:
istio-init:
Container ID: docker://e2abc8508805b6284951ffed2b506e8e226d838d262f75c019dba54592759fb4
Image: gcr.io/gke-release/asm/proxyv2:1.10.4-asm.6
Image ID: docker-pullable://gcr.io/gke-release/asm/proxyv2@sha256:011f852521d2859d711ddb0c8b93da5dfabd0ec79e8710f02ff05de278472617
Port: <none>
Host Port: <none>
Args:
istio-iptables
-p
15001
-z
15006
-u
1337
-m
REDIRECT
-i
*
-x
-b
*
-d
15090,15021,15020
State: Terminated
Reason: Completed
Exit Code: 0
Started: Sun, 19 Sep 2021 23:33:14 +0000
Finished: Sun, 19 Sep 2021 23:33:14 +0000
Ready: True
Restart Count: 0
Limits:
cpu: 2
memory: 1Gi
Requests:
cpu: 100m
memory: 128Mi
Environment:
CA_PROVIDER: GoogleCA
GCE_METADATA_HOST: metadata.google.internal
GCP_METADATA: jamxl-kfp-dev|1027430761074|kf-ref20|us-west1-b
GKE_CLUSTER_URL: https://container.googleapis.com/v1/projects/jamxl-kfp-dev/locations/us-west1-b/clusters/kf-ref20
PLUGINS: GoogleTokenExchange
USE_TOKEN_FOR_CSR: true
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-fwll6 (ro)
Containers:
tensorflow:
Container ID: docker://1caa0ff13f19aded56c27d5c754bad3b5ac571df14fd26dfc77820f30c9d99ba
Image: docker.io/liuhougangxa/tf-estimator-mnist
Image ID: docker-pullable://liuhougangxa/tf-estimator-mnist@sha256:00669e66921e9d93f981d354a066be2c7965f30bb646fcccb9ad621b419c03c8
Port: 2222/TCP
Host Port: 0/TCP
Command:
sh
-c
Args:
python /opt/model.py --tf-train-steps=200 --tf-learning-rate=0.04915907370779257 --tf-batch-size=88 1>/var/log/katib/metrics.log 2>&1 && echo completed > /var/log/katib/$$$$.pid
State: Terminated
Reason: Completed
Exit Code: 0
Started: Sun, 19 Sep 2021 23:33:18 +0000
Finished: Sun, 19 Sep 2021 23:33:50 +0000
Ready: False
Restart Count: 0
Environment:
TF_CONFIG: {"cluster":{"chief":["mnist-e2e-2-vdblgt8r-chief-0.namespace1.svc:2222"],"worker":["mnist-e2e-2-vdblgt8r-worker-0.namespace1.svc:2222"]},"task":{"type":"chief","index":0},"environment":"cloud"}
Mounts:
/var/log/katib from metrics-volume (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-fwll6 (ro)
istio-proxy:
Container ID: docker://34ab8ecfc9d7e5f7f286eee391d318ac255c4b0782e642ffb4c60367472f6119
Image: gcr.io/gke-release/asm/proxyv2:1.10.4-asm.6
Image ID: docker-pullable://gcr.io/gke-release/asm/proxyv2@sha256:011f852521d2859d711ddb0c8b93da5dfabd0ec79e8710f02ff05de278472617
Port: 15090/TCP
Host Port: 0/TCP
Args:
proxy
sidecar
--domain
$(POD_NAMESPACE).svc.cluster.local
--serviceCluster
mnist-e2e-2-vdblgt8r-chief-0.namespace1
--proxyLogLevel=warning
--proxyComponentLogLevel=misc:error
--log_output_level=default:info
--stsPort=15463
--concurrency
2
State: Running
Started: Sun, 19 Sep 2021 23:33:19 +0000
Ready: True
Restart Count: 0
Limits:
cpu: 2
memory: 1Gi
Requests:
cpu: 100m
memory: 128Mi
Readiness: http-get http://:15021/healthz/ready delay=1s timeout=3s period=2s #success=1 #failure=30
Environment:
JWT_POLICY: third-party-jwt
PILOT_CERT_PROVIDER: kubernetes
CA_ADDR: meshca.googleapis.com:443
POD_NAME: mnist-e2e-2-vdblgt8r-chief-0 (v1:metadata.name)
POD_NAMESPACE: namespace1 (v1:metadata.namespace)
INSTANCE_IP: (v1:status.podIP)
SERVICE_ACCOUNT: (v1:spec.serviceAccountName)
HOST_IP: (v1:status.hostIP)
CANONICAL_SERVICE: (v1:metadata.labels['service.istio.io/canonical-name'])
CANONICAL_REVISION: (v1:metadata.labels['service.istio.io/canonical-revision'])
PROXY_CONFIG: {"discoveryAddress":"istiod-asm-1104-6.istio-system.svc:15012","proxyMetadata":{"CA_PROVIDER":"GoogleCA","GCE_METADATA_HOST":"metadata.google.internal","GCP_METADATA":"jamxl-kfp-dev|1027430761074|kf-ref20|us-west1-b","GKE_CLUSTER_URL":"https://container.googleapis.com/v1/projects/jamxl-kfp-dev/locations/us-west1-b/clusters/kf-ref20","PLUGINS":"GoogleTokenExchange","USE_TOKEN_FOR_CSR":"true"},"meshId":"proj-1027430761074"}
ISTIO_META_POD_PORTS: [
{"name":"tfjob-port","containerPort":2222,"protocol":"TCP"}
]
ISTIO_META_APP_CONTAINERS: tensorflow
ISTIO_META_INTERCEPTION_MODE: REDIRECT
ISTIO_META_WORKLOAD_NAME: mnist-e2e-2-vdblgt8r-chief-0
ISTIO_META_OWNER: kubernetes://apis/v1/namespaces/namespace1/pods/mnist-e2e-2-vdblgt8r-chief-0
ISTIO_META_MESH_ID: proj-1027430761074
TRUST_DOMAIN: jamxl-kfp-dev.svc.id.goog
CA_PROVIDER: GoogleCA
GCE_METADATA_HOST: metadata.google.internal
GCP_METADATA: jamxl-kfp-dev|1027430761074|kf-ref20|us-west1-b
GKE_CLUSTER_URL: https://container.googleapis.com/v1/projects/jamxl-kfp-dev/locations/us-west1-b/clusters/kf-ref20
PLUGINS: GoogleTokenExchange
USE_TOKEN_FOR_CSR: true
ISTIO_META_CLUSTER_ID: cn-jamxl-kfp-dev-us-west1-b-kf-ref20
ISTIO_META_NETWORK: jamxl-kfp-dev-default
Mounts:
/etc/istio/pod from istio-podinfo (rw)
/etc/istio/proxy from istio-envoy (rw)
/var/lib/istio/data from istio-data (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-fwll6 (ro)
/var/run/secrets/tokens from istio-token (rw)
metrics-logger-and-collector:
Container ID: docker://fa0f82026bf3373e788ac70a0cf533412edae0a2eabaa1deca8524b9d3bc4b3b
Image: docker.io/kubeflowkatib/file-metrics-collector:v0.12.0-rc.1
Image ID: docker-pullable://kubeflowkatib/file-metrics-collector@sha256:5870bd94383293af0c7baaa800e451a9ff9c8621cd565d4c09e76cf3371c7088
Port: <none>
Host Port: <none>
Args:
-t
mnist-e2e-2-vdblgt8r
-m
loss
-o-type
minimize
-s-db
katib-db-manager.kubeflow:6789
-path
/var/log/katib/metrics.log
State: Running
Started: Sun, 19 Sep 2021 23:33:21 +0000
Ready: True
Restart Count: 0
Limits:
cpu: 500m
ephemeral-storage: 5Gi
memory: 100Mi
Requests:
cpu: 50m
ephemeral-storage: 500Mi
memory: 10Mi
Environment: <none>
Mounts:
/var/log/katib from metrics-volume (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-fwll6 (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
istio-envoy:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium: Memory
SizeLimit: <unset>
istio-data:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
istio-podinfo:
Type: DownwardAPI (a volume populated by information about the pod)
Items:
metadata.labels -> labels
metadata.annotations -> annotations
limits.cpu -> cpu-limit
requests.cpu -> cpu-request
istio-token:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 43200
default-token-fwll6:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-fwll6
Optional: false
metrics-volume:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events: <none>
For some reason, Istio sidecar annotation has not been disabled on the TFJob training containers. I think it is related to this issue: https://github.com/kubeflow/tf-operator/issues/1403 cc @kubeflow/wg-training-leads
em. Let me have a check and try to bump the version first.
@zijianjoy I guess https://github.com/kubeflow/manifests/pull/2032 should resolve this issue, please can you try to use the latest training operator controller image and run it again.
Thank you @Jeffwan and @andreyvelich ! I validated that the latest change to use training operator v1.3.0-rc.2 fixed the issue.
cc @kimwnasptd , can release team pick this up for new manifest release candidate? https://github.com/kubeflow/manifests/pull/2032 Thank you so much!
Spec
Kubeflow 1.4-rc.0 Platform: GCP K8s: GKE 1.19.12-gke.2101 Image type: Container-Optimized OS with Docker (cos) Library versions:
Status
Katib - Launch experiment step is hanging until time out:
cc @andreyvelich