Open zijianjoy opened 3 years ago
KFServing step has failed
Error message:
{'apiVersion': 'serving.kubeflow.org/v1beta1', 'kind': 'InferenceService', 'metadata': {'annotations': {'sidecar.istio.io/inject': 'false'}, 'creationTimestamp': '2021-05-11T21:34:33Z', 'finalizers': ['inferenceservice.finalizers'], 'generation': 1, 'managedFields': [{'apiVersion': 'serving.kubeflow.org/v1beta1', 'fieldsType': 'FieldsV1', 'fieldsV1': {'f:metadata': {'f:annotations': {'.': {}, 'f:sidecar.istio.io/inject': {}}}, 'f:spec': {'.': {}, 'f:predictor': {'.': {}, 'f:tensorflow': {'.': {}, 'f:storageUri': {}}}}}, 'manager': 'OpenAPI-Generator', 'operation': 'Update', 'time': '2021-05-11T21:34:31Z'}, {'apiVersion': 'serving.kubeflow.org/v1beta1', 'fieldsType': 'FieldsV1', 'fieldsV1': {'f:metadata': {'f:finalizers': {}}, 'f:spec': {'f:predictor': {'f:containers': {}, 'f:tensorflow': {'f:args': {}, 'f:command': {}, 'f:image': {}}}}, 'f:status': {'.': {}, 'f:components': {'.': {}, 'f:predictor': {'.': {}, 'f:latestCreatedRevision': {}, 'f:latestReadyRevision': {}, 'f:latestRolledoutRevision': {}, 'f:traffic': {}}}, 'f:conditions': {}}}, 'manager': 'manager', 'operation': 'Update', 'time': '2021-05-11T21:35:05Z'}], 'name': 'mnist-e2e', 'namespace': 'profile', 'resourceVersion': '651911', 'selfLink': '/apis/serving.kubeflow.org/v1beta1/namespaces/profile/inferenceservices/mnist-e2e', 'uid': '69783a07-cb35-41ae-8809-6c2efa15b954'}, 'spec': {'predictor': {'tensorflow': {'name': 'kfserving-container', 'resources': {'limits': {'cpu': '1', 'memory': '2Gi'}, 'requests': {'cpu': '1', 'memory': '2Gi'}}, 'runtimeVersion': '1.14.0', 'storageUri': 'pvc://end-to-end-pipeline-ztsm4-model-volume/'}}}, 'status': {'components': {'predictor': {'latestCreatedRevision': 'mnist-e2e-predictor-default-00002', 'latestReadyRevision': 'mnist-e2e-predictor-default-00001', 'latestRolledoutRevision': 'mnist-e2e-predictor-default-00001', 'traffic': [{'latestRevision': True, 'percent': 100, 'revisionName': 'mnist-e2e-predictor-default-00001'}]}}, 'conditions': [{'lastTransitionTime': '2021-05-11T21:35:05Z', 'severity': 'Info', 'status': 'True', 'type': 'PredictorConfigurationReady'}, {'lastTransitionTime': '2021-05-11T21:35:05Z', 'status': 'Unknown', 'type': 'PredictorReady'}, {'lastTransitionTime': '2021-05-11T21:34:36Z', 'severity': 'Info', 'status': 'Unknown', 'type': 'PredictorRouteReady'}, {'lastTransitionTime': '2021-05-11T21:35:05Z', 'status': 'Unknown', 'type': 'Ready'}]}} Model is timed out, please check the InferenceService events for more details.
InferenceService Event Message:
Warning InternalError 22m v1beta1Controllers fails to reconcile predictor: fails to update knative service: Operation cannot be fulfilled on services.serving.knative.dev "mnist-e2e-predictor-default": the object has been modified; please apply your changes to the latest version and try again
Thank you for validating this example @zijianjoy! Couple of questions:
InferenceService
or only this warning ?InferenceService
create pod with the deployed model ? If yes, can you check the logs please.cc @yuzisun @pvaneck Do you have any ideas for this error ?
Thank you @andreyvelich !
kubectl describe inferenceservice mnist-e2e-2 -n profile
and get the following:Name: mnist-e2e-2
Namespace: profile
Labels: <none>
Annotations: sidecar.istio.io/inject: false
API Version: serving.kubeflow.org/v1beta1
Kind: InferenceService
Metadata:
Creation Timestamp: 2021-05-11T22:42:13Z
Finalizers:
inferenceservice.finalizers
Generation: 1
Managed Fields:
API Version: serving.kubeflow.org/v1beta1
Fields Type: FieldsV1
fieldsV1:
f:metadata:
f:annotations:
.:
f:sidecar.istio.io/inject:
f:spec:
.:
f:predictor:
.:
f:tensorflow:
.:
f:storageUri:
Manager: OpenAPI-Generator
Operation: Update
Time: 2021-05-11T22:42:10Z
API Version: serving.kubeflow.org/v1beta1
Fields Type: FieldsV1
fieldsV1:
f:metadata:
f:finalizers:
f:spec:
f:predictor:
f:containers:
f:tensorflow:
f:args:
f:command:
f:image:
f:status:
.:
f:components:
.:
f:predictor:
.:
f:latestCreatedRevision:
f:latestReadyRevision:
f:latestRolledoutRevision:
f:traffic:
f:conditions:
Manager: manager
Operation: Update
Time: 2021-05-11T22:42:33Z
Resource Version: 739007
Self Link: /apis/serving.kubeflow.org/v1beta1/namespaces/profile/inferenceservices/mnist-e2e-2
UID: 98f645c8-b006-44c5-9754-149b8eba22a9
Spec:
Predictor:
Tensorflow:
Name: kfserving-container
Resources:
Limits:
Cpu: 1
Memory: 2Gi
Requests:
Cpu: 1
Memory: 2Gi
Runtime Version: 1.14.0
Storage Uri: pvc://end-to-end-pipeline-ztsm4-model-volume/
Status:
Components:
Predictor:
Latest Created Revision: mnist-e2e-2-predictor-default-00002
Latest Ready Revision: mnist-e2e-2-predictor-default-00001
Latest Rolledout Revision: mnist-e2e-2-predictor-default-00001
Traffic:
Latest Revision: true
Percent: 100
Revision Name: mnist-e2e-2-predictor-default-00001
Conditions:
Last Transition Time: 2021-05-11T22:42:33Z
Severity: Info
Status: True
Type: PredictorConfigurationReady
Last Transition Time: 2021-05-11T22:52:18Z
Message: Revision "mnist-e2e-2-predictor-default-00002" failed with message: .
Reason: RevisionFailed
Status: False
Type: PredictorReady
Last Transition Time: 2021-05-11T22:52:18Z
Message: Revision "mnist-e2e-2-predictor-default-00002" failed with message: .
Reason: RevisionFailed
Severity: Info
Status: False
Type: PredictorRouteReady
Last Transition Time: 2021-05-11T22:52:18Z
Message: Revision "mnist-e2e-2-predictor-default-00002" failed with message: .
Reason: RevisionFailed
Status: False
Type: Ready
Events: <none>
kubectl logs kfserving-controller-manager-0 --container=manager
and get message:
{"level":"info","ts":1620768876.3706956,"logger":"KsvcReconciler","msg":"knative service configuration diff (-desired, +observed):","diff":" v1.ConfigurationSpec{\n \tTemplate: v1.RevisionTemplateSpec{\n \t\tObjectMeta: {Labels: {\"component\": \"predictor\", \"serving.kubeflow.org/inferenceservice\": \"mnist-e2e\"}, Annotations: {\"autoscaling.knative.dev/class\": \"kpa.autoscaling.knative.dev\", \"autoscaling.knative.dev/minScale\": \"1\", \"internal.serving.kubeflow.org/storage-initializer-sourceuri\": \"pvc://end-to-end-pipeline-ztsm4-model-volume/\", \"sidecar.istio.io/inject\": \"false\"}},\n \t\tSpec: v1.RevisionSpec{\n \t\t\tPodSpec: v1.PodSpec{\n \t\t\t\t... // 28 identical fields\n \t\t\t\tReadinessGates: nil,\n \t\t\t\tRuntimeClassName: nil,\n- \t\t\t\tEnableServiceLinks: nil,\n+ \t\t\t\tEnableServiceLinks: &false,\n \t\t\t\tPreemptionPolicy: nil,\n \t\t\t\tOverhead: nil,\n \t\t\t\t... // 2 identical fields\n \t\t\t},\n \t\t\tContainerConcurrency: &0,\n \t\t\tTimeoutSeconds: &300,\n \t\t},\n \t},\n }\n"} {"level":"info","ts":1620768876.3707314,"logger":"KsvcReconciler","msg":"Updating knative service","namespace":"profile","name":"mnist-e2e-predictor-default"} {"level":"info","ts":1620768876.4037502,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling ingress for inference service","isvc":"mnist-e2e"} {"level":"info","ts":1620768876.4270444,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling inference service","apiVersion":"serving.kubeflow.org/v1beta1","isvc":"mnist-e2e"} {"level":"info","ts":1620768876.4271548,"logger":"PredictorReconciler","msg":"Reconciling Predictor","PredictorSpec":{"tensorflow":{"storageUri":"pvc://end-to-end-pipeline-ztsm4-model-volume/","runtimeVersion":"1.14.0","name":"kfserving-container","resources":{"limits":{"cpu":"1","memory":"2Gi"},"requests":{"cpu":"1","memory":"2Gi"}}}}} {"level":"info","ts":1620768876.4545128,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling ingress for inference service","isvc":"mnist-e2e"} {"level":"info","ts":1620768876.501067,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling inference service","apiVersion":"serving.kubeflow.org/v1beta1","isvc":"mnist-e2e"} {"level":"info","ts":1620768876.5011787,"logger":"PredictorReconciler","msg":"Reconciling Predictor","PredictorSpec":{"tensorflow":{"storageUri":"pvc://end-to-end-pipeline-ztsm4-model-volume/","runtimeVersion":"1.14.0","name":"kfserving-container","resources":{"limits":{"cpu":"1","memory":"2Gi"},"requests":{"cpu":"1","memory":"2Gi"}}}}} {"level":"info","ts":1620768876.5015552,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling ingress for inference service","isvc":"mnist-e2e"} {"level":"info","ts":1620768876.5292032,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling inference service","apiVersion":"serving.kubeflow.org/v1beta1","isvc":"mnist-e2e"} {"level":"info","ts":1620768876.5293138,"logger":"PredictorReconciler","msg":"Reconciling Predictor","PredictorSpec":{"tensorflow":{"storageUri":"pvc://end-to-end-pipeline-ztsm4-model-volume/","runtimeVersion":"1.14.0","name":"kfserving-container","resources":{"limits":{"cpu":"1","memory":"2Gi"},"requests":{"cpu":"1","memory":"2Gi"}}}}} {"level":"info","ts":1620768876.5297685,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling ingress for inference service","isvc":"mnist-e2e"} {"level":"info","ts":1620768876.6035028,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling inference service","apiVersion":"serving.kubeflow.org/v1beta1","isvc":"mnist-e2e"} {"level":"info","ts":1620768876.603637,"logger":"PredictorReconciler","msg":"Reconciling Predictor","PredictorSpec":{"tensorflow":{"storageUri":"pvc://end-to-end-pipeline-ztsm4-model-volume/","runtimeVersion":"1.14.0","name":"kfserving-container","resources":{"limits":{"cpu":"1","memory":"2Gi"},"requests":{"cpu":"1","memory":"2Gi"}}}}} {"level":"info","ts":1620768876.6040368,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling ingress for inference service","isvc":"mnist-e2e"} {"level":"info","ts":1620768876.613567,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling inference service","apiVersion":"serving.kubeflow.org/v1beta1","isvc":"mnist-e2e"} {"level":"info","ts":1620768876.6136866,"logger":"PredictorReconciler","msg":"Reconciling Predictor","PredictorSpec":{"tensorflow":{"storageUri":"pvc://end-to-end-pipeline-ztsm4-model-volume/","runtimeVersion":"1.14.0","name":"kfserving-container","resources":{"limits":{"cpu":"1","memory":"2Gi"},"requests":{"cpu":"1","memory":"2Gi"}}}}} {"level":"info","ts":1620768876.6139464,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling ingress for inference service","isvc":"mnist-e2e"} {"level":"info","ts":1620768877.5144792,"logger":"CredentialBulder","msg":"found secret","SecretName":"default-token-n7cx5"} {"level":"info","ts":1620768878.1645737,"logger":"CredentialBulder","msg":"found secret","SecretName":"default-token-n7cx5"} {"level":"info","ts":1620768905.243309,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling inference service","apiVersion":"serving.kubeflow.org/v1beta1","isvc":"mnist-e2e"} {"level":"info","ts":1620768905.2436843,"logger":"PredictorReconciler","msg":"Reconciling Predictor","PredictorSpec":{"tensorflow":{"storageUri":"pvc://end-to-end-pipeline-ztsm4-model-volume/","runtimeVersion":"1.14.0","name":"kfserving-container","resources":{"limits":{"cpu":"1","memory":"2Gi"},"requests":{"cpu":"1","memory":"2Gi"}}}}} {"level":"info","ts":1620768905.2440674,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling ingress for inference service","isvc":"mnist-e2e"} {"level":"info","ts":1620768905.2623904,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling inference service","apiVersion":"serving.kubeflow.org/v1beta1","isvc":"mnist-e2e"} {"level":"info","ts":1620768905.262604,"logger":"PredictorReconciler","msg":"Reconciling Predictor","PredictorSpec":{"tensorflow":{"storageUri":"pvc://end-to-end-pipeline-ztsm4-model-volume/","runtimeVersion":"1.14.0","name":"kfserving-container","resources":{"limits":{"cpu":"1","memory":"2Gi"},"requests":{"cpu":"1","memory":"2Gi"}}}}} {"level":"info","ts":1620768905.2630544,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling ingress for inference service","isvc":"mnist-e2e"} {"level":"info","ts":1620768905.395155,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling inference service","apiVersion":"serving.kubeflow.org/v1beta1","isvc":"mnist-e2e"} {"level":"info","ts":1620768905.3952935,"logger":"PredictorReconciler","msg":"Reconciling Predictor","PredictorSpec":{"tensorflow":{"storageUri":"pvc://end-to-end-pipeline-ztsm4-model-volume/","runtimeVersion":"1.14.0","name":"kfserving-container","resources":{"limits":{"cpu":"1","memory":"2Gi"},"requests":{"cpu":"1","memory":"2Gi"}}}}} {"level":"info","ts":1620768905.3956945,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling ingress for inference service","isvc":"mnist-e2e"} {"level":"info","ts":1620768905.4056737,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling inference service","apiVersion":"serving.kubeflow.org/v1beta1","isvc":"mnist-e2e"} {"level":"info","ts":1620768905.4058557,"logger":"PredictorReconciler","msg":"Reconciling Predictor","PredictorSpec":{"tensorflow":{"storageUri":"pvc://end-to-end-pipeline-ztsm4-model-volume/","runtimeVersion":"1.14.0","name":"kfserving-container","resources":{"limits":{"cpu":"1","memory":"2Gi"},"requests":{"cpu":"1","memory":"2Gi"}}}}} {"level":"info","ts":1620768905.406223,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling ingress for inference service","isvc":"mnist-e2e"} {"level":"info","ts":1620768905.685143,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling inference service","apiVersion":"serving.kubeflow.org/v1beta1","isvc":"mnist-e2e"} {"level":"info","ts":1620768905.685468,"logger":"PredictorReconciler","msg":"Reconciling Predictor","PredictorSpec":{"tensorflow":{"storageUri":"pvc://end-to-end-pipeline-ztsm4-model-volume/","runtimeVersion":"1.14.0","name":"kfserving-container","resources":{"limits":{"cpu":"1","memory":"2Gi"},"requests":{"cpu":"1","memory":"2Gi"}}}}} {"level":"info","ts":1620768905.686008,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling ingress for inference service","isvc":"mnist-e2e"} {"level":"info","ts":1620768905.7073374,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling inference service","apiVersion":"serving.kubeflow.org/v1beta1","isvc":"mnist-e2e"} {"level":"info","ts":1620768905.707659,"logger":"PredictorReconciler","msg":"Reconciling Predictor","PredictorSpec":{"tensorflow":{"storageUri":"pvc://end-to-end-pipeline-ztsm4-model-volume/","runtimeVersion":"1.14.0","name":"kfserving-container","resources":{"limits":{"cpu":"1","memory":"2Gi"},"requests":{"cpu":"1","memory":"2Gi"}}}}} {"level":"info","ts":1620768905.7082765,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling ingress for inference service","isvc":"mnist-e2e"} {"level":"info","ts":1620768905.828258,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling inference service","apiVersion":"serving.kubeflow.org/v1beta1","isvc":"mnist-e2e"} {"level":"info","ts":1620768905.828412,"logger":"PredictorReconciler","msg":"Reconciling Predictor","PredictorSpec":{"tensorflow":{"storageUri":"pvc://end-to-end-pipeline-ztsm4-model-volume/","runtimeVersion":"1.14.0","name":"kfserving-container","resources":{"limits":{"cpu":"1","memory":"2Gi"},"requests":{"cpu":"1","memory":"2Gi"}}}}} {"level":"info","ts":1620768905.828831,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling ingress for inference service","isvc":"mnist-e2e"} {"level":"info","ts":1620768905.8389184,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling inference service","apiVersion":"serving.kubeflow.org/v1beta1","isvc":"mnist-e2e"} {"level":"info","ts":1620768905.839035,"logger":"PredictorReconciler","msg":"Reconciling Predictor","PredictorSpec":{"tensorflow":{"storageUri":"pvc://end-to-end-pipeline-ztsm4-model-volume/","runtimeVersion":"1.14.0","name":"kfserving-container","resources":{"limits":{"cpu":"1","memory":"2Gi"},"requests":{"cpu":"1","memory":"2Gi"}}}}} {"level":"info","ts":1620768905.839494,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling ingress for inference service","isvc":"mnist-e2e"} {"level":"info","ts":1620769478.190142,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling inference service","apiVersion":"serving.kubeflow.org/v1beta1","isvc":"mnist-e2e"} {"level":"info","ts":1620769478.1902683,"logger":"PredictorReconciler","msg":"Reconciling Predictor","PredictorSpec":{"tensorflow":{"storageUri":"pvc://end-to-end-pipeline-ztsm4-model-volume/","runtimeVersion":"1.14.0","name":"kfserving-container","resources":{"limits":{"cpu":"1","memory":"2Gi"},"requests":{"cpu":"1","memory":"2Gi"}}}}} {"level":"info","ts":1620769478.190749,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling ingress for inference service","isvc":"mnist-e2e"} {"level":"info","ts":1620769478.2048721,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling inference service","apiVersion":"serving.kubeflow.org/v1beta1","isvc":"mnist-e2e"} {"level":"info","ts":1620769478.2049851,"logger":"PredictorReconciler","msg":"Reconciling Predictor","PredictorSpec":{"tensorflow":{"storageUri":"pvc://end-to-end-pipeline-ztsm4-model-volume/","runtimeVersion":"1.14.0","name":"kfserving-container","resources":{"limits":{"cpu":"1","memory":"2Gi"},"requests":{"cpu":"1","memory":"2Gi"}}}}} {"level":"info","ts":1620769478.2053647,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling ingress for inference service","isvc":"mnist-e2e"} {"level":"info","ts":1620772405.0318704,"logger":"inferenceservice-v1beta1-mutating-webhook","msg":"Defaulting InferenceService","namespace":"profile","isvc":{"tensorflow":{"storageUri":"pvc://end-to-end-pipeline-ztsm4-model-volume/","name":"","resources":{}}}} I0511 22:33:26.082805 1 request.go:645] Throttling request took 1.027992074s, request: GET:https://10.99.240.1:443/apis/acme.cert-manager.io/v1alpha2?timeout=32s {"level":"info","ts":1620772407.1123698,"logger":"inferenceservice-v1beta1-validation-webhook","msg":"validate create","name":"mnist-e2e"} {"level":"info","ts":1620772490.816721,"logger":"v1beta1Controllers.InferenceService","msg":"Deleting external resources","InferenceService":"mnist-e2e"} {"level":"info","ts":1620772490.8220859,"logger":"inferenceservice-v1beta1-mutating-webhook","msg":"Defaulting InferenceService","namespace":"profile","isvc":{"tensorflow":{"storageUri":"pvc://end-to-end-pipeline-ztsm4-model-volume/","runtimeVersion":"1.14.0","name":"kfserving-container","resources":{"limits":{"cpu":"1","memory":"2Gi"},"requests":{"cpu":"1","memory":"2Gi"}}}}} I0511 22:34:51.872799 1 request.go:645] Throttling request took 1.045716514s, request: GET:https://10.99.240.1:443/apis/kubeflow.org/v1beta1?timeout=32s {"level":"info","ts":1620772492.8906178,"logger":"inferenceservice-v1beta1-validation-webhook","msg":"validate update","name":"mnist-e2e"} {"level":"info","ts":1620772492.8906608,"logger":"inferenceservice-v1beta1-validation-webhook","msg":"validate create","name":"mnist-e2e"} {"level":"info","ts":1620772492.895843,"logger":"inferenceservice-v1beta1-mutating-webhook","msg":"Defaulting InferenceService","namespace":"profile","isvc":{"tensorflow":{"storageUri":"pvc://end-to-end-pipeline-ztsm4-model-volume/","runtimeVersion":"1.14.0","name":"kfserving-container","resources":{"limits":{"cpu":"1","memory":"2Gi"},"requests":{"cpu":"1","memory":"2Gi"}}}}} {"level":"info","ts":1620772494.9074528,"logger":"inferenceservice-v1beta1-validation-webhook","msg":"validate update","name":"mnist-e2e"} {"level":"info","ts":1620772494.9074967,"logger":"inferenceservice-v1beta1-validation-webhook","msg":"validate create","name":"mnist-e2e"} {"level":"info","ts":1620772930.9721427,"logger":"inferenceservice-v1beta1-mutating-webhook","msg":"Defaulting InferenceService","namespace":"profile","isvc":{"tensorflow":{"storageUri":"pvc://end-to-end-pipeline-ztsm4-model-volume/","name":"","resources":{}}}} I0511 22:42:12.022887 1 request.go:645] Throttling request took 1.043349721s, request: GET:https://10.99.240.1:443/apis/install.istio.io/v1alpha1?timeout=32s {"level":"info","ts":1620772933.0588965,"logger":"inferenceservice-v1beta1-validation-webhook","msg":"validate create","name":"mnist-e2e-2"} {"level":"info","ts":1620772933.1549237,"logger":"inferenceservice-v1beta1-mutating-webhook","msg":"Defaulting InferenceService","namespace":"profile","isvc":{"tensorflow":{"storageUri":"pvc://end-to-end-pipeline-ztsm4-model-volume/","runtimeVersion":"1.14.0","name":"kfserving-container","resources":{"limits":{"cpu":"1","memory":"2Gi"},"requests":{"cpu":"1","memory":"2Gi"}}}}} {"level":"info","ts":1620772935.2185495,"logger":"inferenceservice-v1beta1-validation-webhook","msg":"validate update","name":"mnist-e2e-2"} {"level":"info","ts":1620772935.2185953,"logger":"inferenceservice-v1beta1-validation-webhook","msg":"validate create","name":"mnist-e2e-2"} {"level":"info","ts":1620772935.2244377,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling inference service","apiVersion":"serving.kubeflow.org/v1beta1","isvc":"mnist-e2e-2"} {"level":"info","ts":1620772935.2247844,"logger":"PredictorReconciler","msg":"Reconciling Predictor","PredictorSpec":{"tensorflow":{"storageUri":"pvc://end-to-end-pipeline-ztsm4-model-volume/","runtimeVersion":"1.14.0","name":"kfserving-container","resources":{"limits":{"cpu":"1","memory":"2Gi"},"requests":{"cpu":"1","memory":"2Gi"}}}}} {"level":"info","ts":1620772935.2251766,"logger":"KsvcReconciler","msg":"Creating knative service","namespace":"profile","name":"mnist-e2e-2-predictor-default"} {"level":"info","ts":1620772935.4839823,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling ingress for inference service","isvc":"mnist-e2e-2"} {"level":"info","ts":1620772935.4934385,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling inference service","apiVersion":"serving.kubeflow.org/v1beta1","isvc":"mnist-e2e-2"} {"level":"info","ts":1620772935.4936116,"logger":"PredictorReconciler","msg":"Reconciling Predictor","PredictorSpec":{"tensorflow":{"storageUri":"pvc://end-to-end-pipeline-ztsm4-model-volume/","runtimeVersion":"1.14.0","name":"kfserving-container","resources":{"limits":{"cpu":"1","memory":"2Gi"},"requests":{"cpu":"1","memory":"2Gi"}}}}} {"level":"info","ts":1620772935.557141,"logger":"KsvcReconciler","msg":"knative service configuration diff (-desired, +observed):","diff":" v1.ConfigurationSpec{\n \tTemplate: v1.RevisionTemplateSpec{\n \t\tObjectMeta: {Labels: {\"component\": \"predictor\", \"serving.kubeflow.org/inferenceservice\": \"mnist-e2e-2\"}, Annotations: {\"autoscaling.knative.dev/class\": \"kpa.autoscaling.knative.dev\", \"autoscaling.knative.dev/minScale\": \"1\", \"internal.serving.kubeflow.org/storage-initializer-sourceuri\": \"pvc://end-to-end-pipeline-ztsm4-model-volume/\", \"sidecar.istio.io/inject\": \"false\"}},\n \t\tSpec: v1.RevisionSpec{\n \t\t\tPodSpec: v1.PodSpec{\n \t\t\t\t... // 28 identical fields\n \t\t\t\tReadinessGates: nil,\n \t\t\t\tRuntimeClassName: nil,\n- \t\t\t\tEnableServiceLinks: nil,\n+ \t\t\t\tEnableServiceLinks: &false,\n \t\t\t\tPreemptionPolicy: nil,\n \t\t\t\tOverhead: nil,\n \t\t\t\t... // 2 identical fields\n \t\t\t},\n \t\t\tContainerConcurrency: &0,\n \t\t\tTimeoutSeconds: &300,\n \t\t},\n \t},\n }\n"} {"level":"info","ts":1620772935.5572016,"logger":"KsvcReconciler","msg":"Updating knative service","namespace":"profile","name":"mnist-e2e-2-predictor-default"} {"level":"info","ts":1620772935.59776,"logger":"KsvcReconciler","msg":"Updating knative service","namespace":"profile","name":"mnist-e2e-2-predictor-default"} {"level":"info","ts":1620772935.6823056,"logger":"KsvcReconciler","msg":"Updating knative service","namespace":"profile","name":"mnist-e2e-2-predictor-default"} {"level":"info","ts":1620772935.7107418,"logger":"KsvcReconciler","msg":"Updating knative service","namespace":"profile","name":"mnist-e2e-2-predictor-default"} {"level":"info","ts":1620772935.7364764,"logger":"KsvcReconciler","msg":"Updating knative service","namespace":"profile","name":"mnist-e2e-2-predictor-default"} {"level":"error","ts":1620772935.7957175,"logger":"v1beta1Controllers.InferenceService","msg":"Failed to reconcile","reconciler":"<components.Predictor Value>","Name":"mnist-e2e-2","error":"fails to reconcile predictor: fails to update knative service: Operation cannot be fulfilled on services.serving.knative.dev \"mnist-e2e-2-predictor-default\": the object has been modified; please apply your changes to the latest version and try again","errorVerbose":"Operation cannot be fulfilled on services.serving.knative.dev \"mnist-e2e-2-predictor-default\": the object has been modified; please apply your changes to the latest version and try again\nfails to update knative service\ngithub.com/kubeflow/kfserving/pkg/controller/v1beta1/inferenceservice/reconcilers/knative.(KsvcReconciler).Reconcile\n\t/go/src/github.com/kubeflow/kfserving/pkg/controller/v1beta1/inferenceservice/reconcilers/knative/ksvc_reconciler.go:184\ngithub.com/kubeflow/kfserving/pkg/controller/v1beta1/inferenceservice/components.(Predictor).Reconcile\n\t/go/src/github.com/kubeflow/kfserving/pkg/controller/v1beta1/inferenceservice/components/predictor.go:117\ngithub.com/kubeflow/kfserving/pkg/controller/v1beta1/inferenceservice.(InferenceServiceReconciler).Reconcile\n\t/go/src/github.com/kubeflow/kfserving/pkg/controller/v1beta1/inferenceservice/controller.go:132\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.7.0/pkg/internal/controller/controller.go:263\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.7.0/pkg/internal/controller/controller.go:235\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(Controller).Start.func1.1\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.7.0/pkg/internal/controller/controller.go:198\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1\n\t/go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext\n\t/go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.UntilWithContext\n\t/go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:99\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1373\nfails to reconcile predictor\ngithub.com/kubeflow/kfserving/pkg/controller/v1beta1/inferenceservice/components.(Predictor).Reconcile\n\t/go/src/github.com/kubeflow/kfserving/pkg/controller/v1beta1/inferenceservice/components/predictor.go:119\ngithub.com/kubeflow/kfserving/pkg/controller/v1beta1/inferenceservice.(InferenceServiceReconciler).Reconcile\n\t/go/src/github.com/kubeflow/kfserving/pkg/controller/v1beta1/inferenceservice/controller.go:132\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.7.0/pkg/internal/controller/controller.go:263\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.7.0/pkg/internal/controller/controller.go:235\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(Controller).Start.func1.1\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.7.0/pkg/internal/controller/controller.go:198\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1\n\t/go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext\n\t/go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.UntilWithContext\n\t/go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:99\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1373","stacktrace":"github.com/go-logr/zapr.(zapLogger).Error\n\t/go/pkg/mod/github.com/go-logr/zapr@v0.2.0/zapr.go:132\ngithub.com/kubeflow/kfserving/pkg/controller/v1beta1/inferenceservice.(InferenceServiceReconciler).Reconcile\n\t/go/src/github.com/kubeflow/kfserving/pkg/controller/v1beta1/inferenceservice/controller.go:133\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.7.0/pkg/internal/controller/controller.go:263\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.7.0/pkg/internal/controller/controller.go:235\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(Controller).Start.func1.1\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.7.0/pkg/internal/controller/controller.go:198\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1\n\t/go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext\n\t/go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.UntilWithContext\n\t/go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:99"} {"level":"error","ts":1620772935.7959955,"logger":"controller-runtime.manager.controller.inferenceservice","msg":"Reconciler error","reconciler group":"serving.kubeflow.org","reconciler kind":"InferenceService","name":"mnist-e2e-2","namespace":"profile","error":"fails to reconcile component: fails to reconcile predictor: fails to update knative service: Operation cannot be fulfilled on services.serving.knative.dev \"mnist-e2e-2-predictor-default\": the object has been modified; please apply your changes to the latest version and try again","errorVerbose":"Operation cannot be fulfilled on services.serving.knative.dev \"mnist-e2e-2-predictor-default\": the object has been modified; please apply your changes to the latest version and try again\nfails to update knative service\ngithub.com/kubeflow/kfserving/pkg/controller/v1beta1/inferenceservice/reconcilers/knative.(KsvcReconciler).Reconcile\n\t/go/src/github.com/kubeflow/kfserving/pkg/controller/v1beta1/inferenceservice/reconcilers/knative/ksvc_reconciler.go:184\ngithub.com/kubeflow/kfserving/pkg/controller/v1beta1/inferenceservice/components.(Predictor).Reconcile\n\t/go/src/github.com/kubeflow/kfserving/pkg/controller/v1beta1/inferenceservice/components/predictor.go:117\ngithub.com/kubeflow/kfserving/pkg/controller/v1beta1/inferenceservice.(InferenceServiceReconciler).Reconcile\n\t/go/src/github.com/kubeflow/kfserving/pkg/controller/v1beta1/inferenceservice/controller.go:132\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.7.0/pkg/internal/controller/controller.go:263\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.7.0/pkg/internal/controller/controller.go:235\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(Controller).Start.func1.1\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.7.0/pkg/internal/controller/controller.go:198\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1\n\t/go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext\n\t/go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.UntilWithContext\n\t/go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:99\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1373\nfails to reconcile predictor\ngithub.com/kubeflow/kfserving/pkg/controller/v1beta1/inferenceservice/components.(Predictor).Reconcile\n\t/go/src/github.com/kubeflow/kfserving/pkg/controller/v1beta1/inferenceservice/components/predictor.go:119\ngithub.com/kubeflow/kfserving/pkg/controller/v1beta1/inferenceservice.(InferenceServiceReconciler).Reconcile\n\t/go/src/github.com/kubeflow/kfserving/pkg/controller/v1beta1/inferenceservice/controller.go:132\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.7.0/pkg/internal/controller/controller.go:263\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.7.0/pkg/internal/controller/controller.go:235\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(Controller).Start.func1.1\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.7.0/pkg/internal/controller/controller.go:198\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1\n\t/go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext\n\t/go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.UntilWithContext\n\t/go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:99\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1373\nfails to reconcile component\ngithub.com/kubeflow/kfserving/pkg/controller/v1beta1/inferenceservice.(InferenceServiceReconciler).Reconcile\n\t/go/src/github.com/kubeflow/kfserving/pkg/controller/v1beta1/inferenceservice/controller.go:135\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.7.0/pkg/internal/controller/controller.go:263\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.7.0/pkg/internal/controller/controller.go:235\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(Controller).Start.func1.1\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.7.0/pkg/internal/controller/controller.go:198\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1\n\t/go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext\n\t/go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.UntilWithContext\n\t/go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:99\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1373","stacktrace":"github.com/go-logr/zapr.(zapLogger).Error\n\t/go/pkg/mod/github.com/go-logr/zapr@v0.2.0/zapr.go:132\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.7.0/pkg/internal/controller/controller.go:267\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.7.0/pkg/internal/controller/controller.go:235\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(Controller).Start.func1.1\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.7.0/pkg/internal/controller/controller.go:198\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1\n\t/go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext\n\t/go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.UntilWithContext\n\t/go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:99"} {"level":"info","ts":1620772935.7962446,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling inference service","apiVersion":"serving.kubeflow.org/v1beta1","isvc":"mnist-e2e-2"} {"level":"info","ts":1620772935.7964034,"logger":"PredictorReconciler","msg":"Reconciling Predictor","PredictorSpec":{"tensorflow":{"storageUri":"pvc://end-to-end-pipeline-ztsm4-model-volume/","runtimeVersion":"1.14.0","name":"kfserving-container","resources":{"limits":{"cpu":"1","memory":"2Gi"},"requests":{"cpu":"1","memory":"2Gi"}}}}} {"level":"info","ts":1620772935.7974548,"logger":"KsvcReconciler","msg":"knative service configuration diff (-desired, +observed):","diff":" v1.ConfigurationSpec{\n \tTemplate: v1.RevisionTemplateSpec{\n \t\tObjectMeta: {Labels: {\"component\": \"predictor\", \"serving.kubeflow.org/inferenceservice\": \"mnist-e2e-2\"}, Annotations: {\"autoscaling.knative.dev/class\": \"kpa.autoscaling.knative.dev\", \"autoscaling.knative.dev/minScale\": \"1\", \"internal.serving.kubeflow.org/storage-initializer-sourceuri\": \"pvc://end-to-end-pipeline-ztsm4-model-volume/\", \"sidecar.istio.io/inject\": \"false\"}},\n \t\tSpec: v1.RevisionSpec{\n \t\t\tPodSpec: v1.PodSpec{\n \t\t\t\t... // 28 identical fields\n \t\t\t\tReadinessGates: nil,\n \t\t\t\tRuntimeClassName: nil,\n- \t\t\t\tEnableServiceLinks: nil,\n+ \t\t\t\tEnableServiceLinks: &false,\n \t\t\t\tPreemptionPolicy: nil,\n \t\t\t\tOverhead: nil,\n \t\t\t\t... // 2 identical fields\n \t\t\t},\n \t\t\tContainerConcurrency: &0,\n \t\t\tTimeoutSeconds: &300,\n \t\t},\n \t},\n }\n"} {"level":"info","ts":1620772935.7975194,"logger":"KsvcReconciler","msg":"Updating knative service","namespace":"profile","name":"mnist-e2e-2-predictor-default"} {"level":"info","ts":1620772935.8235893,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling ingress for inference service","isvc":"mnist-e2e-2"} {"level":"info","ts":1620772935.8498397,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling inference service","apiVersion":"serving.kubeflow.org/v1beta1","isvc":"mnist-e2e-2"} {"level":"info","ts":1620772935.8500657,"logger":"PredictorReconciler","msg":"Reconciling Predictor","PredictorSpec":{"tensorflow":{"storageUri":"pvc://end-to-end-pipeline-ztsm4-model-volume/","runtimeVersion":"1.14.0","name":"kfserving-container","resources":{"limits":{"cpu":"1","memory":"2Gi"},"requests":{"cpu":"1","memory":"2Gi"}}}}} {"level":"info","ts":1620772935.8545716,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling ingress for inference service","isvc":"mnist-e2e-2"} {"level":"info","ts":1620772935.8548284,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling inference service","apiVersion":"serving.kubeflow.org/v1beta1","isvc":"mnist-e2e-2"} {"level":"info","ts":1620772935.8549552,"logger":"PredictorReconciler","msg":"Reconciling Predictor","PredictorSpec":{"tensorflow":{"storageUri":"pvc://end-to-end-pipeline-ztsm4-model-volume/","runtimeVersion":"1.14.0","name":"kfserving-container","resources":{"limits":{"cpu":"1","memory":"2Gi"},"requests":{"cpu":"1","memory":"2Gi"}}}}} {"level":"info","ts":1620772935.8554227,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling ingress for inference service","isvc":"mnist-e2e-2"} {"level":"info","ts":1620772935.8848317,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling inference service","apiVersion":"serving.kubeflow.org/v1beta1","isvc":"mnist-e2e-2"} {"level":"info","ts":1620772935.8849595,"logger":"PredictorReconciler","msg":"Reconciling Predictor","PredictorSpec":{"tensorflow":{"storageUri":"pvc://end-to-end-pipeline-ztsm4-model-volume/","runtimeVersion":"1.14.0","name":"kfserving-container","resources":{"limits":{"cpu":"1","memory":"2Gi"},"requests":{"cpu":"1","memory":"2Gi"}}}}} {"level":"info","ts":1620772935.8853812,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling ingress for inference service","isvc":"mnist-e2e-2"} {"level":"info","ts":1620772935.913133,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling inference service","apiVersion":"serving.kubeflow.org/v1beta1","isvc":"mnist-e2e-2"} {"level":"info","ts":1620772935.9132566,"logger":"PredictorReconciler","msg":"Reconciling Predictor","PredictorSpec":{"tensorflow":{"storageUri":"pvc://end-to-end-pipeline-ztsm4-model-volume/","runtimeVersion":"1.14.0","name":"kfserving-container","resources":{"limits":{"cpu":"1","memory":"2Gi"},"requests":{"cpu":"1","memory":"2Gi"}}}}} {"level":"info","ts":1620772935.9136732,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling ingress for inference service","isvc":"mnist-e2e-2"} {"level":"info","ts":1620772936.0020895,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling inference service","apiVersion":"serving.kubeflow.org/v1beta1","isvc":"mnist-e2e-2"} {"level":"info","ts":1620772936.002502,"logger":"PredictorReconciler","msg":"Reconciling Predictor","PredictorSpec":{"tensorflow":{"storageUri":"pvc://end-to-end-pipeline-ztsm4-model-volume/","runtimeVersion":"1.14.0","name":"kfserving-container","resources":{"limits":{"cpu":"1","memory":"2Gi"},"requests":{"cpu":"1","memory":"2Gi"}}}}} {"level":"info","ts":1620772936.0030398,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling ingress for inference service","isvc":"mnist-e2e-2"} {"level":"info","ts":1620772936.022172,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling inference service","apiVersion":"serving.kubeflow.org/v1beta1","isvc":"mnist-e2e-2"} {"level":"info","ts":1620772936.022295,"logger":"PredictorReconciler","msg":"Reconciling Predictor","PredictorSpec":{"tensorflow":{"storageUri":"pvc://end-to-end-pipeline-ztsm4-model-volume/","runtimeVersion":"1.14.0","name":"kfserving-container","resources":{"limits":{"cpu":"1","memory":"2Gi"},"requests":{"cpu":"1","memory":"2Gi"}}}}} {"level":"info","ts":1620772936.0227535,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling ingress for inference service","isvc":"mnist-e2e-2"} {"level":"info","ts":1620772936.0678174,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling inference service","apiVersion":"serving.kubeflow.org/v1beta1","isvc":"mnist-e2e-2"} {"level":"info","ts":1620772936.0679862,"logger":"PredictorReconciler","msg":"Reconciling Predictor","PredictorSpec":{"tensorflow":{"storageUri":"pvc://end-to-end-pipeline-ztsm4-model-volume/","runtimeVersion":"1.14.0","name":"kfserving-container","resources":{"limits":{"cpu":"1","memory":"2Gi"},"requests":{"cpu":"1","memory":"2Gi"}}}}} {"level":"info","ts":1620772936.068509,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling ingress for inference service","isvc":"mnist-e2e-2"} {"level":"info","ts":1620772937.9828658,"logger":"CredentialBulder","msg":"found secret","SecretName":"default-token-n7cx5"} {"level":"info","ts":1620772938.0545385,"logger":"CredentialBulder","msg":"found secret","SecretName":"default-token-n7cx5"} {"level":"info","ts":1620772952.766172,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling inference service","apiVersion":"serving.kubeflow.org/v1beta1","isvc":"mnist-e2e-2"} {"level":"info","ts":1620772952.766301,"logger":"PredictorReconciler","msg":"Reconciling Predictor","PredictorSpec":{"tensorflow":{"storageUri":"pvc://end-to-end-pipeline-ztsm4-model-volume/","runtimeVersion":"1.14.0","name":"kfserving-container","resources":{"limits":{"cpu":"1","memory":"2Gi"},"requests":{"cpu":"1","memory":"2Gi"}}}}} {"level":"info","ts":1620772952.7665982,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling ingress for inference service","isvc":"mnist-e2e-2"} {"level":"info","ts":1620772952.8829322,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling inference service","apiVersion":"serving.kubeflow.org/v1beta1","isvc":"mnist-e2e-2"} {"level":"info","ts":1620772952.8831072,"logger":"PredictorReconciler","msg":"Reconciling Predictor","PredictorSpec":{"tensorflow":{"storageUri":"pvc://end-to-end-pipeline-ztsm4-model-volume/","runtimeVersion":"1.14.0","name":"kfserving-container","resources":{"limits":{"cpu":"1","memory":"2Gi"},"requests":{"cpu":"1","memory":"2Gi"}}}}} {"level":"info","ts":1620772952.8835628,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling ingress for inference service","isvc":"mnist-e2e-2"} {"level":"info","ts":1620772953.0701418,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling inference service","apiVersion":"serving.kubeflow.org/v1beta1","isvc":"mnist-e2e-2"} {"level":"info","ts":1620772953.0703616,"logger":"PredictorReconciler","msg":"Reconciling Predictor","PredictorSpec":{"tensorflow":{"storageUri":"pvc://end-to-end-pipeline-ztsm4-model-volume/","runtimeVersion":"1.14.0","name":"kfserving-container","resources":{"limits":{"cpu":"1","memory":"2Gi"},"requests":{"cpu":"1","memory":"2Gi"}}}}} {"level":"info","ts":1620772953.070938,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling ingress for inference service","isvc":"mnist-e2e-2"} {"level":"info","ts":1620772953.1341956,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling inference service","apiVersion":"serving.kubeflow.org/v1beta1","isvc":"mnist-e2e-2"} {"level":"info","ts":1620772953.134363,"logger":"PredictorReconciler","msg":"Reconciling Predictor","PredictorSpec":{"tensorflow":{"storageUri":"pvc://end-to-end-pipeline-ztsm4-model-volume/","runtimeVersion":"1.14.0","name":"kfserving-container","resources":{"limits":{"cpu":"1","memory":"2Gi"},"requests":{"cpu":"1","memory":"2Gi"}}}}} {"level":"info","ts":1620772953.1351116,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling ingress for inference service","isvc":"mnist-e2e-2"} {"level":"info","ts":1620772953.566679,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling inference service","apiVersion":"serving.kubeflow.org/v1beta1","isvc":"mnist-e2e-2"} {"level":"info","ts":1620772953.5668004,"logger":"PredictorReconciler","msg":"Reconciling Predictor","PredictorSpec":{"tensorflow":{"storageUri":"pvc://end-to-end-pipeline-ztsm4-model-volume/","runtimeVersion":"1.14.0","name":"kfserving-container","resources":{"limits":{"cpu":"1","memory":"2Gi"},"requests":{"cpu":"1","memory":"2Gi"}}}}} {"level":"info","ts":1620772953.5673482,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling ingress for inference service","isvc":"mnist-e2e-2"} {"level":"info","ts":1620772953.5886502,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling inference service","apiVersion":"serving.kubeflow.org/v1beta1","isvc":"mnist-e2e-2"} {"level":"info","ts":1620772953.5887856,"logger":"PredictorReconciler","msg":"Reconciling Predictor","PredictorSpec":{"tensorflow":{"storageUri":"pvc://end-to-end-pipeline-ztsm4-model-volume/","runtimeVersion":"1.14.0","name":"kfserving-container","resources":{"limits":{"cpu":"1","memory":"2Gi"},"requests":{"cpu":"1","memory":"2Gi"}}}}} {"level":"info","ts":1620772953.5891395,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling ingress for inference service","isvc":"mnist-e2e-2"} {"level":"info","ts":1620772953.6058204,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling inference service","apiVersion":"serving.kubeflow.org/v1beta1","isvc":"mnist-e2e-2"} {"level":"info","ts":1620772953.6059582,"logger":"PredictorReconciler","msg":"Reconciling Predictor","PredictorSpec":{"tensorflow":{"storageUri":"pvc://end-to-end-pipeline-ztsm4-model-volume/","runtimeVersion":"1.14.0","name":"kfserving-container","resources":{"limits":{"cpu":"1","memory":"2Gi"},"requests":{"cpu":"1","memory":"2Gi"}}}}} {"level":"info","ts":1620772953.6063447,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling ingress for inference service","isvc":"mnist-e2e-2"} {"level":"info","ts":1620772953.6549683,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling inference service","apiVersion":"serving.kubeflow.org/v1beta1","isvc":"mnist-e2e-2"} {"level":"info","ts":1620772953.6551404,"logger":"PredictorReconciler","msg":"Reconciling Predictor","PredictorSpec":{"tensorflow":{"storageUri":"pvc://end-to-end-pipeline-ztsm4-model-volume/","runtimeVersion":"1.14.0","name":"kfserving-container","resources":{"limits":{"cpu":"1","memory":"2Gi"},"requests":{"cpu":"1","memory":"2Gi"}}}}} {"level":"info","ts":1620772953.6555724,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling ingress for inference service","isvc":"mnist-e2e-2"} {"level":"info","ts":1620773538.3125443,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling inference service","apiVersion":"serving.kubeflow.org/v1beta1","isvc":"mnist-e2e-2"} {"level":"info","ts":1620773538.3127034,"logger":"PredictorReconciler","msg":"Reconciling Predictor","PredictorSpec":{"tensorflow":{"storageUri":"pvc://end-to-end-pipeline-ztsm4-model-volume/","runtimeVersion":"1.14.0","name":"kfserving-container","resources":{"limits":{"cpu":"1","memory":"2Gi"},"requests":{"cpu":"1","memory":"2Gi"}}}}} {"level":"info","ts":1620773538.3131828,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling ingress for inference service","isvc":"mnist-e2e-2"} {"level":"info","ts":1620773538.3351989,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling inference service","apiVersion":"serving.kubeflow.org/v1beta1","isvc":"mnist-e2e-2"} {"level":"info","ts":1620773538.3353457,"logger":"PredictorReconciler","msg":"Reconciling Predictor","PredictorSpec":{"tensorflow":{"storageUri":"pvc://end-to-end-pipeline-ztsm4-model-volume/","runtimeVersion":"1.14.0","name":"kfserving-container","resources":{"limits":{"cpu":"1","memory":"2Gi"},"requests":{"cpu":"1","memory":"2Gi"}}}}} {"level":"info","ts":1620773538.3357365,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling ingress for inference service","isvc":"mnist-e2e-2"} {"level":"info","ts":1620790652.2456176,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling inference service","apiVersion":"serving.kubeflow.org/v1beta1","isvc":"mnist-e2e-2"} {"level":"info","ts":1620790652.2460716,"logger":"PredictorReconciler","msg":"Reconciling Predictor","PredictorSpec":{"tensorflow":{"storageUri":"pvc://end-to-end-pipeline-ztsm4-model-volume/","runtimeVersion":"1.14.0","name":"kfserving-container","resources":{"limits":{"cpu":"1","memory":"2Gi"},"requests":{"cpu":"1","memory":"2Gi"}}}}} {"level":"info","ts":1620790652.246698,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling ingress for inference service","isvc":"mnist-e2e-2"} {"level":"info","ts":1620790652.2469358,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling inference service","apiVersion":"serving.kubeflow.org/v1beta1","isvc":"mnist-e2e-2"} {"level":"info","ts":1620790652.247036,"logger":"PredictorReconciler","msg":"Reconciling Predictor","PredictorSpec":{"tensorflow":{"storageUri":"pvc://end-to-end-pipeline-ztsm4-model-volume/","runtimeVersion":"1.14.0","name":"kfserving-container","resources":{"limits":{"cpu":"1","memory":"2Gi"},"requests":{"cpu":"1","memory":"2Gi"}}}}} {"level":"info","ts":1620790652.2472653,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling ingress for inference service","isvc":"mnist-e2e-2"} {"level":"info","ts":1620794383.3470883,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling inference service","apiVersion":"serving.kubeflow.org/v1beta1","isvc":"mnist-e2e-2"} {"level":"info","ts":1620794383.347398,"logger":"PredictorReconciler","msg":"Reconciling Predictor","PredictorSpec":{"tensorflow":{"storageUri":"pvc://end-to-end-pipeline-ztsm4-model-volume/","runtimeVersion":"1.14.0","name":"kfserving-container","resources":{"limits":{"cpu":"1","memory":"2Gi"},"requests":{"cpu":"1","memory":"2Gi"}}}}} {"level":"info","ts":1620794383.3480515,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling ingress for inference service","isvc":"mnist-e2e-2"} {"level":"info","ts":1620826109.6370797,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling inference service","apiVersion":"serving.kubeflow.org/v1beta1","isvc":"mnist-e2e-2"} {"level":"info","ts":1620826109.637784,"logger":"PredictorReconciler","msg":"Reconciling Predictor","PredictorSpec":{"tensorflow":{"storageUri":"pvc://end-to-end-pipeline-ztsm4-model-volume/","runtimeVersion":"1.14.0","name":"kfserving-container","resources":{"limits":{"cpu":"1","memory":"2Gi"},"requests":{"cpu":"1","memory":"2Gi"}}}}} {"level":"info","ts":1620826109.6397216,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling ingress for inference service","isvc":"mnist-e2e-2"} {"level":"info","ts":1620833555.0125809,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling inference service","apiVersion":"serving.kubeflow.org/v1beta1","isvc":"mnist-e2e-2"} {"level":"info","ts":1620833555.0127358,"logger":"PredictorReconciler","msg":"Reconciling Predictor","PredictorSpec":{"tensorflow":{"storageUri":"pvc://end-to-end-pipeline-ztsm4-model-volume/","runtimeVersion":"1.14.0","name":"kfserving-container","resources":{"limits":{"cpu":"1","memory":"2Gi"},"requests":{"cpu":"1","memory":"2Gi"}}}}} {"level":"info","ts":1620833555.013176,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling ingress for inference service","isvc":"mnist-e2e-2"} {"level":"info","ts":1620861567.025887,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling inference service","apiVersion":"serving.kubeflow.org/v1beta1","isvc":"mnist-e2e-2"} {"level":"info","ts":1620861567.0260932,"logger":"PredictorReconciler","msg":"Reconciling Predictor","PredictorSpec":{"tensorflow":{"storageUri":"pvc://end-to-end-pipeline-ztsm4-model-volume/","runtimeVersion":"1.14.0","name":"kfserving-container","resources":{"limits":{"cpu":"1","memory":"2Gi"},"requests":{"cpu":"1","memory":"2Gi"}}}}} {"level":"info","ts":1620861567.0266256,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling ingress for inference service","isvc":"mnist-e2e-2"} {"level":"info","ts":1620872726.678136,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling inference service","apiVersion":"serving.kubeflow.org/v1beta1","isvc":"mnist-e2e-2"} {"level":"info","ts":1620872726.6783519,"logger":"PredictorReconciler","msg":"Reconciling Predictor","PredictorSpec":{"tensorflow":{"storageUri":"pvc://end-to-end-pipeline-ztsm4-model-volume/","runtimeVersion":"1.14.0","name":"kfserving-container","resources":{"limits":{"cpu":"1","memory":"2Gi"},"requests":{"cpu":"1","memory":"2Gi"}}}}} {"level":"info","ts":1620872726.7392294,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling ingress for inference service","isvc":"mnist-e2e-2"} E0513 05:04:28.791187 1 reflector.go:127] pkg/mod/k8s.io/client-go@v0.19.2/tools/cache/reflector.go:156: Failed to watch v1alpha3.VirtualService: the server is currently unable to handle the request (get virtualservices.networking.istio.io) E0513 05:04:28.791179 1 reflector.go:127] pkg/mod/k8s.io/client-go@v0.19.2/tools/cache/reflector.go:156: Failed to watch v1beta1.InferenceService: the server is currently unable to handle the request (get inferenceservices.serving.kubeflow.org) E0513 05:04:28.791281 1 reflector.go:127] pkg/mod/k8s.io/client-go@v0.19.2/tools/cache/reflector.go:156: Failed to watch v1.Service: the server is currently unable to handle the request (get services.serving.knative.dev) E0513 05:04:28.853626 1 reflector.go:127] pkg/mod/k8s.io/client-go@v0.19.2/tools/cache/reflector.go:156: Failed to watch v1alpha1.TrainedModel: the server is currently unable to handle the request (get trainedmodels.serving.kubeflow.org) {"level":"info","ts":1620882272.3579726,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling inference service","apiVersion":"serving.kubeflow.org/v1beta1","isvc":"mnist-e2e-2"} {"level":"info","ts":1620882272.3582094,"logger":"PredictorReconciler","msg":"Reconciling Predictor","PredictorSpec":{"tensorflow":{"storageUri":"pvc://end-to-end-pipeline-ztsm4-model-volume/","runtimeVersion":"1.14.0","name":"kfserving-container","resources":{"limits":{"cpu":"1","memory":"2Gi"},"requests":{"cpu":"1","memory":"2Gi"}}}}} {"level":"info","ts":1620882272.3588848,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling ingress for inference service","isvc":"mnist-e2e-2"} {"level":"info","ts":1620882276.8830557,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling inference service","apiVersion":"serving.kubeflow.org/v1beta1","isvc":"mnist-e2e-2"} {"level":"info","ts":1620882276.883248,"logger":"PredictorReconciler","msg":"Reconciling Predictor","PredictorSpec":{"tensorflow":{"storageUri":"pvc://end-to-end-pipeline-ztsm4-model-volume/","runtimeVersion":"1.14.0","name":"kfserving-container","resources":{"limits":{"cpu":"1","memory":"2Gi"},"requests":{"cpu":"1","memory":"2Gi"}}}}} {"level":"info","ts":1620882276.883727,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling ingress for inference service","isvc":"mnist-e2e-2"} E0513 05:12:48.049634 1 reflector.go:127] pkg/mod/k8s.io/client-go@v0.19.2/tools/cache/reflector.go:156: Failed to watch v1.Service: the server is currently unable to handle the request (get services.serving.knative.dev) E0513 05:12:48.083670 1 reflector.go:127] pkg/mod/k8s.io/client-go@v0.19.2/tools/cache/reflector.go:156: Failed to watch v1alpha1.TrainedModel: the server is currently unable to handle the request (get trainedmodels.serving.kubeflow.org) E0513 05:12:48.123574 1 reflector.go:127] pkg/mod/k8s.io/client-go@v0.19.2/tools/cache/reflector.go:156: Failed to watch v1alpha3.VirtualService: the server is currently unable to handle the request (get virtualservices.networking.istio.io) E0513 05:12:48.124023 1 reflector.go:127] pkg/mod/k8s.io/client-go@v0.19.2/tools/cache/reflector.go:156: Failed to watch v1beta1.InferenceService: the server is currently unable to handle the request (get inferenceservices.serving.kubeflow.org) I0513 05:13:00.457722 1 trace.go:116] Trace[1790964291]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.19.2/tools/cache/reflector.go:156 (started: 2021-05-13 05:12:49.373439004 +0000 UTC m=+127593.803534183) (total time: 11.084169617s): Trace[1790964291]: [11.084141032s] [11.084141032s] Objects listed I0513 05:13:05.124312 1 trace.go:116] Trace[1375632262]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.19.2/tools/cache/reflector.go:156 (started: 2021-05-13 05:12:48.98320931 +0000 UTC m=+127593.413304470) (total time: 16.141065844s): Trace[1375632262]: [16.141039304s] [16.141039304s] Objects listed {"level":"info","ts":1620882785.1247594,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling inference service","apiVersion":"serving.kubeflow.org/v1beta1","isvc":"mnist-e2e-2"} {"level":"info","ts":1620882785.1249487,"logger":"PredictorReconciler","msg":"Reconciling Predictor","PredictorSpec":{"tensorflow":{"storageUri":"pvc://end-to-end-pipeline-ztsm4-model-volume/","runtimeVersion":"1.14.0","name":"kfserving-container","resources":{"limits":{"cpu":"1","memory":"2Gi"},"requests":{"cpu":"1","memory":"2Gi"}}}}} {"level":"info","ts":1620882785.126026,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling ingress for inference service","isvc":"mnist-e2e-2"} I0513 05:13:05.333913 1 trace.go:116] Trace[616970762]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.19.2/tools/cache/reflector.go:156 (started: 2021-05-13 05:12:49.217382364 +0000 UTC m=+127593.647477528) (total time: 16.11649424s): Trace[616970762]: [16.116485398s] [16.116485398s] Objects listed I0513 05:13:10.785735 1 trace.go:116] Trace[275276684]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.19.2/tools/cache/reflector.go:156 (started: 2021-05-13 05:12:49.633848559 +0000 UTC m=+127594.063943779) (total time: 21.151859663s): Trace[275276684]: [21.151806247s] [21.151806247s] Objects listed {"level":"info","ts":1620882790.785898,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling inference service","apiVersion":"serving.kubeflow.org/v1beta1","isvc":"mnist-e2e-2"} {"level":"info","ts":1620882790.78602,"logger":"PredictorReconciler","msg":"Reconciling Predictor","PredictorSpec":{"tensorflow":{"storageUri":"pvc://end-to-end-pipeline-ztsm4-model-volume/","runtimeVersion":"1.14.0","name":"kfserving-container","resources":{"limits":{"cpu":"1","memory":"2Gi"},"requests":{"cpu":"1","memory":"2Gi"}}}}} {"level":"info","ts":1620882790.7864337,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling ingress for inference service","isvc":"mnist-e2e-2"} {"level":"info","ts":1620918242.515071,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling inference service","apiVersion":"serving.kubeflow.org/v1beta1","isvc":"mnist-e2e-2"} {"level":"info","ts":1620918242.5152462,"logger":"PredictorReconciler","msg":"Reconciling Predictor","PredictorSpec":{"tensorflow":{"storageUri":"pvc://end-to-end-pipeline-ztsm4-model-volume/","runtimeVersion":"1.14.0","name":"kfserving-container","resources":{"limits":{"cpu":"1","memory":"2Gi"},"requests":{"cpu":"1","memory":"2Gi"}}}}} {"level":"info","ts":1620918242.5157752,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling ingress for inference service","isvc":"mnist-e2e-2"} {"level":"info","ts":1620918242.5163105,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling inference service","apiVersion":"serving.kubeflow.org/v1beta1","isvc":"mnist-e2e-2"} {"level":"info","ts":1620918242.5164633,"logger":"PredictorReconciler","msg":"Reconciling Predictor","PredictorSpec":{"tensorflow":{"storageUri":"pvc://end-to-end-pipeline-ztsm4-model-volume/","runtimeVersion":"1.14.0","name":"kfserving-container","resources":{"limits":{"cpu":"1","memory":"2Gi"},"requests":{"cpu":"1","memory":"2Gi"}}}}} {"level":"info","ts":1620918242.5168517,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling ingress for inference service","isvc":"mnist-e2e-2"} {"level":"info","ts":1620921962.45178,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling inference service","apiVersion":"serving.kubeflow.org/v1beta1","isvc":"mnist-e2e-2"} {"level":"info","ts":1620921962.4519224,"logger":"PredictorReconciler","msg":"Reconciling Predictor","PredictorSpec":{"tensorflow":{"storageUri":"pvc://end-to-end-pipeline-ztsm4-model-volume/","runtimeVersion":"1.14.0","name":"kfserving-container","resources":{"limits":{"cpu":"1","memory":"2Gi"},"requests":{"cpu":"1","memory":"2Gi"}}}}} {"level":"info","ts":1620921962.452611,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling ingress for inference service","isvc":"mnist-e2e-2"} {"level":"info","ts":1620944126.435673,"logger":"inferenceservice-v1beta1-mutating-webhook","msg":"Defaulting InferenceService","namespace":"profile","isvc":{"tensorflow":{"storageUri":"pvc://end-to-end-pipeline-ztsm4-model-volume/","name":"","resources":{}}}} I0513 22:15:27.486792 1 request.go:645] Throttling request took 1.045021006s, request: GET:https://10.99.240.1:443/apis/acme.cert-manager.io/v1alpha2?timeout=32s {"level":"info","ts":1620944128.7775807,"logger":"inferenceservice-v1beta1-validation-webhook","msg":"validate create","name":"mnist-e2e-2"}
3. I ran for failed predictor (revision 02)` kubectl logs mnist-e2e-2-predictor-default-00002-deployment-784bc4d56d-zznkn --container=storage-initializer -n profile` and get:
Error from server (BadRequest): container "storage-initializer" in pod "mnist-e2e-2-predictor-default-00002-deployment-784bc4d56d-zznkn" is waiting to start: PodInitializing
4. I ran for the successful predictor (01) `kubectl logs mnist-e2e-2-predictor-default-00001-deployment-7875989b7d-hdh88 --container=kfserving-container -n profile` and get
2021-05-11 22:42:30.106496: I tensorflow_serving/model_servers/server.cc:82] Building single TensorFlow model file config: model_name: mnist-e2e-2 model_base_path: /mnt/models 2021-05-11 22:42:30.106835: I tensorflow_serving/model_servers/server_core.cc:462] Adding/updating models. 2021-05-11 22:42:30.106864: I tensorflow_serving/model_servers/server_core.cc:561] (Re-)adding model: mnist-e2e-2 2021-05-11 22:42:30.209319: I tensorflow_serving/core/basic_manager.cc:739] Successfully reserved resources to load servable {name: mnist-e2e-2 version: 1620772899} 2021-05-11 22:42:30.209400: I tensorflow_serving/core/loader_harness.cc:66] Approving load for servable version {name: mnist-e2e-2 version: 1620772899} 2021-05-11 22:42:30.209422: I tensorflow_serving/core/loader_harness.cc:74] Loading servable version {name: mnist-e2e-2 version: 1620772899} 2021-05-11 22:42:30.209455: I external/org_tensorflow/tensorflow/contrib/session_bundle/bundle_shim.cc:363] Attempting to load native SavedModelBundle in bundle-shim from: /mnt/models/1620772899 2021-05-11 22:42:30.209471: I external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:31] Reading SavedModel from: /mnt/models/1620772899 2021-05-11 22:42:30.230562: I external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:54] Reading meta graph with tags { serve } 2021-05-11 22:42:30.231653: I external/org_tensorflow/tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA 2021-05-11 22:42:30.249943: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:202] Restoring SavedModel bundle. 2021-05-11 22:42:30.459086: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:151] Running initialization op on SavedModel bundle at path: /mnt/models/1620772899 2021-05-11 22:42:30.461961: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:311] SavedModel load for tags { serve }; Status: success. Took 252482 microseconds. 2021-05-11 22:42:30.462051: I tensorflow_serving/servables/tensorflow/saved_model_warmup.cc:103] No warmup data file found at /mnt/models/1620772899/assets.extra/tf_serving_warmup_requests 2021-05-11 22:42:30.462224: I tensorflow_serving/core/loader_harness.cc:86] Successfully loaded servable version {name: mnist-e2e-2 version: 1620772899} 2021-05-11 22:42:30.465202: I tensorflow_serving/model_servers/server.cc:324] Running gRPC ModelServer at 0.0.0.0:9000 ... [warn] getaddrinfo: address family for nodename not supported 2021-05-11 22:42:30.467374: I tensorflow_serving/model_servers/server.cc:344] Exporting HTTP/REST API at:localhost:8080 ... [evhttp_server.cc : 239] RAW: Entering the event loop ...
Thank you for providing this information @zijianjoy. What version are you using for KFServing manager? Is it v0.5.1 ?
It seems that storage-initializer
container can't be executed and in the PodInitializing
state.
In addition, can you provide output of this commands:
kubectl get revision mnist-e2e-2-predictor-default-00002 -n profile
kubectl describe revision mnist-e2e-2-predictor-default-00002 -n profile
kubectl describe pod mnist-e2e-2-predictor-default-00002-deployment-784bc4d56d-zznkn -n profile
.Also, I tried to deploy Kubeflow manifest on GCP with Dex setup: https://github.com/kubeflow/manifests/blob/master/example/kustomization.yaml. For me this example is working.
@yuzisun Any ideas why storage-initializer
is in the Initializing state ?
Also cc @theofpa for the KFServing question.
Thank you @andreyvelich for helping to debug.
I confirm that KFServing manager version is v0.5.1.
Providing following logs:
kubectl get revision mnist-e2e-predictor-default-00002 -n profile-ref10
NAME CONFIG NAME K8S SERVICE NAME GENERATION READY REASON
mnist-e2e-predictor-default-00002 mnist-e2e-predictor-default mnist-e2e-predictor-default-00002 2 False PodInitializing
kubectl describe revision mnist-e2e-predictor-default-00002 -n profile-ref10
Name: mnist-e2e-predictor-default-00002
Namespace: profile-ref10
Labels: component=predictor
serving.knative.dev/configuration=mnist-e2e-predictor-default
serving.knative.dev/configurationGeneration=2
serving.knative.dev/configurationUID=5b98ca8d-e1eb-45b0-9555-3cabd9644e91
serving.knative.dev/routingState=active
serving.knative.dev/service=mnist-e2e-predictor-default
serving.knative.dev/serviceUID=7dc1aba0-6e58-4555-b015-acde92f53292
serving.kubeflow.org/inferenceservice=mnist-e2e
Annotations: autoscaling.knative.dev/class: kpa.autoscaling.knative.dev
autoscaling.knative.dev/minScale: 1
internal.serving.kubeflow.org/storage-initializer-sourceuri: pvc://end-to-end-pipeline-6v2q4-model-volume/
serving.knative.dev/creator: system:serviceaccount:kubeflow:default
serving.knative.dev/routes: mnist-e2e-predictor-default
serving.knative.dev/routingStateModified: 2021-05-14T20:55:41Z
sidecar.istio.io/inject: false
API Version: serving.knative.dev/v1
Kind: Revision
Metadata:
Creation Timestamp: 2021-05-14T20:55:41Z
Generation: 1
Managed Fields:
API Version: serving.knative.dev/v1
Fields Type: FieldsV1
fieldsV1:
f:metadata:
f:annotations:
.:
f:autoscaling.knative.dev/class:
f:autoscaling.knative.dev/minScale:
f:internal.serving.kubeflow.org/storage-initializer-sourceuri:
f:serving.knative.dev/creator:
f:serving.knative.dev/routes:
f:serving.knative.dev/routingStateModified:
f:sidecar.istio.io/inject:
f:labels:
.:
f:component:
f:serving.knative.dev/configuration:
f:serving.knative.dev/configurationGeneration:
f:serving.knative.dev/configurationUID:
f:serving.knative.dev/routingState:
f:serving.knative.dev/service:
f:serving.knative.dev/serviceUID:
f:serving.kubeflow.org/inferenceservice:
f:ownerReferences:
.:
k:{"uid":"5b98ca8d-e1eb-45b0-9555-3cabd9644e91"}:
.:
f:apiVersion:
f:blockOwnerDeletion:
f:controller:
f:kind:
f:name:
f:uid:
f:spec:
.:
f:containerConcurrency:
f:containers:
f:timeoutSeconds:
f:status:
.:
f:conditions:
f:containerStatuses:
f:imageDigest:
f:observedGeneration:
f:serviceName:
Manager: controller
Operation: Update
Time: 2021-05-14T20:55:42Z
Owner References:
API Version: serving.knative.dev/v1
Block Owner Deletion: true
Controller: true
Kind: Configuration
Name: mnist-e2e-predictor-default
UID: 5b98ca8d-e1eb-45b0-9555-3cabd9644e91
Resource Version: 166702
Self Link: /apis/serving.knative.dev/v1/namespaces/profile-ref10/revisions/mnist-e2e-predictor-default-00002
UID: 31815df6-59d4-454e-9654-6e9807b598fc
Spec:
Container Concurrency: 0
Containers:
Args:
--port=9000
--rest_api_port=8080
--model_name=mnist-e2e
--model_base_path=/mnt/models
--rest_api_timeout_in_ms=60000
Command:
/usr/bin/tensorflow_model_server
Image: tensorflow/serving:1.14.0
Name: kfserving-container
Readiness Probe:
Success Threshold: 1
Tcp Socket:
Port: 0
Resources:
Limits:
Cpu: 1
Memory: 2Gi
Requests:
Cpu: 1
Memory: 2Gi
Enable Service Links: false
Timeout Seconds: 300
Status:
Conditions:
Last Transition Time: 2021-05-14T20:55:43Z
Message: Requests to the target are being buffered as resources are provisioned.
Reason: Queued
Severity: Info
Status: Unknown
Type: Active
Last Transition Time: 2021-05-14T20:55:41Z
Status: Unknown
Type: ContainerHealthy
Last Transition Time: 2021-05-14T21:05:44Z
Reason: PodInitializing
Status: False
Type: Ready
Last Transition Time: 2021-05-14T21:05:44Z
Reason: PodInitializing
Status: False
Type: ResourcesAvailable
Container Statuses:
Image Digest: index.docker.io/tensorflow/serving@sha256:f7e59a29cbc17a6b507751cddde37bccad4407c05ebf2c13b8e6ccb7d2e9affb
Name: kfserving-container
Image Digest: index.docker.io/tensorflow/serving@sha256:f7e59a29cbc17a6b507751cddde37bccad4407c05ebf2c13b8e6ccb7d2e9affb
Observed Generation: 1
Service Name: mnist-e2e-predictor-default-00002
Events: <none>
kubectl describe pod mnist-e2e-predictor-default-00002-deployment-86b54db9f9-2qdq5 -n profile-ref10
Name: mnist-e2e-predictor-default-00002-deployment-86b54db9f9-2qdq5
Namespace: profile-ref10
Priority: 0
Node: gke-kf-ref10-default-pool-50caf419-s900/10.138.15.208
Start Time: Fri, 14 May 2021 20:55:44 +0000
Labels: app=mnist-e2e-predictor-default-00002
component=predictor
pod-template-hash=86b54db9f9
service.istio.io/canonical-name=mnist-e2e-predictor-default
service.istio.io/canonical-revision=mnist-e2e-predictor-default-00002
serving.knative.dev/configuration=mnist-e2e-predictor-default
serving.knative.dev/configurationGeneration=2
serving.knative.dev/configurationUID=5b98ca8d-e1eb-45b0-9555-3cabd9644e91
serving.knative.dev/revision=mnist-e2e-predictor-default-00002
serving.knative.dev/revisionUID=31815df6-59d4-454e-9654-6e9807b598fc
serving.knative.dev/service=mnist-e2e-predictor-default
serving.knative.dev/serviceUID=7dc1aba0-6e58-4555-b015-acde92f53292
serving.kubeflow.org/inferenceservice=mnist-e2e
Annotations: autoscaling.knative.dev/class: kpa.autoscaling.knative.dev
autoscaling.knative.dev/minScale: 1
internal.serving.kubeflow.org/storage-initializer-sourceuri: pvc://end-to-end-pipeline-6v2q4-model-volume/
serving.knative.dev/creator: system:serviceaccount:kubeflow:default
sidecar.istio.io/inject: false
Status: Pending
IP:
IPs: <none>
Controlled By: ReplicaSet/mnist-e2e-predictor-default-00002-deployment-86b54db9f9
Init Containers:
storage-initializer:
Container ID:
Image: gcr.io/kfserving/storage-initializer:v0.5.1
Image ID:
Port: <none>
Host Port: <none>
Args:
/mnt/pvc/
/mnt/models
State: Waiting
Reason: PodInitializing
Ready: False
Restart Count: 0
Limits:
cpu: 1
memory: 1Gi
Requests:
cpu: 100m
memory: 100Mi
Environment: <none>
Mounts:
/mnt/models from kfserving-provision-location (rw)
/mnt/pvc from kfserving-pvc-source (ro)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-nbcbj (ro)
Containers:
kfserving-container:
Container ID:
Image: index.docker.io/tensorflow/serving@sha256:f7e59a29cbc17a6b507751cddde37bccad4407c05ebf2c13b8e6ccb7d2e9affb
Image ID:
Port: 8080/TCP
Host Port: 0/TCP
Command:
/usr/bin/tensorflow_model_server
Args:
--port=9000
--rest_api_port=8080
--model_name=mnist-e2e
--model_base_path=/mnt/models
--rest_api_timeout_in_ms=60000
State: Waiting
Reason: PodInitializing
Ready: False
Restart Count: 0
Limits:
cpu: 1
memory: 2Gi
Requests:
cpu: 1
memory: 2Gi
Environment:
PORT: 8080
K_REVISION: mnist-e2e-predictor-default-00002
K_CONFIGURATION: mnist-e2e-predictor-default
K_SERVICE: mnist-e2e-predictor-default
Mounts:
/mnt/models from kfserving-provision-location (ro)
/mnt/pvc from kfserving-pvc-source (ro)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-nbcbj (ro)
queue-proxy:
Container ID:
Image: gcr.io/knative-releases/knative.dev/serving/cmd/queue@sha256:6cd0c234bfbf88ac75df5243c2f9213dcc9def610414c506d418f9388187b771
Image ID:
Ports: 8022/TCP, 9090/TCP, 9091/TCP, 8012/TCP
Host Ports: 0/TCP, 0/TCP, 0/TCP, 0/TCP
State: Waiting
Reason: PodInitializing
Ready: False
Restart Count: 0
Requests:
cpu: 25m
Readiness: http-get http://:8012/ delay=0s timeout=1s period=1s #success=1 #failure=3
Startup: exec [/ko-app/queue -probe-timeout 10m0s] delay=0s timeout=600s period=1s #success=1 #failure=1
Environment:
SERVING_NAMESPACE: profile-ref10
SERVING_SERVICE: mnist-e2e-predictor-default
SERVING_CONFIGURATION: mnist-e2e-predictor-default
SERVING_REVISION: mnist-e2e-predictor-default-00002
QUEUE_SERVING_PORT: 8012
CONTAINER_CONCURRENCY: 0
REVISION_TIMEOUT_SECONDS: 300
SERVING_POD: mnist-e2e-predictor-default-00002-deployment-86b54db9f9-2qdq5 (v1:metadata.name)
SERVING_POD_IP: (v1:status.podIP)
SERVING_LOGGING_CONFIG:
SERVING_LOGGING_LEVEL:
SERVING_REQUEST_LOG_TEMPLATE: {"httpRequest": {"requestMethod": "{{.Request.Method}}", "requestUrl": "{{js .Request.RequestURI}}", "requestSize": "{{.Request.ContentLength}}", "status": {{.Response.Code}}, "responseSize": "{{.Response.Size}}", "userAgent": "{{js .Request.UserAgent}}", "remoteIp": "{{js .Request.RemoteAddr}}", "serverIp": "{{.Revision.PodIP}}", "referer": "{{js .Request.Referer}}", "latency": "{{.Response.Latency}}s", "protocol": "{{.Request.Proto}}"}, "traceId": "{{index .Request.Header "X-B3-Traceid"}}"}
SERVING_ENABLE_REQUEST_LOG: false
SERVING_REQUEST_METRICS_BACKEND: prometheus
TRACING_CONFIG_BACKEND: none
TRACING_CONFIG_ZIPKIN_ENDPOINT:
TRACING_CONFIG_STACKDRIVER_PROJECT_ID:
TRACING_CONFIG_DEBUG: false
TRACING_CONFIG_SAMPLE_RATE: 0.1
USER_PORT: 8080
SYSTEM_NAMESPACE: knative-serving
METRICS_DOMAIN: knative.dev/internal/serving
SERVING_READINESS_PROBE: {"tcpSocket":{"port":8080,"host":"127.0.0.1"},"successThreshold":1}
ENABLE_PROFILING: false
SERVING_ENABLE_PROBE_REQUEST_LOG: false
METRICS_COLLECTOR_ADDRESS:
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-nbcbj (ro)
Conditions:
Type Status
Initialized False
Ready False
ContainersReady False
PodScheduled True
Volumes:
default-token-nbcbj:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-nbcbj
Optional: false
kfserving-pvc-source:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: end-to-end-pipeline-6v2q4-model-volume
ReadOnly: false
kfserving-provision-location:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedMount 6m41s (x1226 over 3d6h) kubelet Unable to attach or mount volumes: unmounted volumes=[kfserving-pvc-source], unattached volumes=[kfserving-pvc-source kfserving-provision-location default-token-nbcbj]: timed out waiting for the condition
Warning FailedMount 2m10s (x455 over 3d6h) kubelet Unable to attach or mount volumes: unmounted volumes=[kfserving-pvc-source], unattached volumes=[default-token-nbcbj kfserving-pvc-source kfserving-provision-location]: timed out waiting for the condition
Additional information:
kubectl get pv -A
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE pvc-3cd47ad0-76eb-44ca-9b01-30d1dbfaa0fa 10Gi RWO Delete Bound profile-ref10/workspace-notebook standard 3d6h pvc-77985fef-a999-45ec-a2fb-2808a4df0675 1Gi RWO Delete Bound profile-ref10/end-to-end-pipeline-6v2q4-model-volume standard 3d6h pvc-c582da8e-1d39-45a3-9913-9726313fdf77 10Gi RWO Delete Bound kubeflow/katib-mysql standard 3d8h
2. Get Persistent Volume Claim of the cluster: `kubectl get pvc -A`
NAMESPACE NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE kubeflow katib-mysql Bound pvc-c582da8e-1d39-45a3-9913-9726313fdf77 10Gi RWO standard 3d8h profile-ref10 end-to-end-pipeline-6v2q4-model-volume Bound pvc-77985fef-a999-45ec-a2fb-2808a4df0675 1Gi RWO standard 3d6h profile-ref10 workspace-notebook Bound pvc-3cd47ad0-76eb-44ca-9b01-30d1dbfaa0fa 10Gi RWO standard 3d6h
@zijianjoy Sounds like model volume can't be attached to the KFServing container. Maybe TFJob blocks it ? From my side on GCP the volume was bounded properly.
Try to remove delete_finished_tfjob=False
parameter in the TFJob op:
op = tfjob_launcher_op(
name=tfjob_name,
namespace=tfjob_namespace,
chief_spec=json.dumps(tfjob_chief_spec),
worker_spec=json.dumps(tfjob_worker_spec),
tfjob_timeout_minutes=60)
In the meantime, please try to describe PVC to check which resources are using this volume:
kubectl describe pvc end-to-end-pipeline-6v2q4-model-volume -n profile-ref10
Try to remove
delete_finished_tfjob=False
parameter in the TFJob op:
If I remove the delete_finished_tfjob
parameter, it will fail at Kubeflow - Launch TFJob
step because it is Exception: Timeout waiting for kubeflow.org/tfjobs mnist-e2e-2 in namespace profile-ref10 to enter one of the conditions ['Succeeded', 'Failed'].
Logs are
INFO:root:Generating tfjob template.
INFO:root:Creating kubeflow.org/tfjobs mnist-e2e-2 in namespace profile-ref10.
INFO:root:Created kubeflow.org/tfjobs mnist-e2e-2 in namespace profile-ref10.
INFO:root:Current condition of kubeflow.org/tfjobs mnist-e2e-2 in namespace profile-ref10 is Created.
...
INFO:root:Current condition of kubeflow.org/tfjobs mnist-e2e-2 in namespace profile-ref10 is Created.
INFO:root:Current condition of kubeflow.org/tfjobs mnist-e2e-2 in namespace profile-ref10 is Created.
INFO:root:Current condition of kubeflow.org/tfjobs mnist-e2e-2 in namespace profile-ref10 is Created.
INFO:root:Current condition of kubeflow.org/tfjobs mnist-e2e-2 in namespace profile-ref10 is Created.
INFO:root:Current condition of kubeflow.org/tfjobs mnist-e2e-2 in namespace profile-ref10 is Created.
INFO:root:Current condition of kubeflow.org/tfjobs mnist-e2e-2 in namespace profile-ref10 is Created.
INFO:root:Current condition of kubeflow.org/tfjobs mnist-e2e-2 in namespace profile-ref10 is Created.
...
INFO:root:Current condition of kubeflow.org/tfjobs mnist-e2e-2 in namespace profile-ref10 is Created.
Traceback (most recent call last):
File "/ml/launch_tfjob.py", line 136, in <module>
main()
File "/ml/launch_tfjob.py", line 131, in main
timeout=datetime.timedelta(minutes=args.tfjobTimeoutMinutes))
File "/ml/launch_crd.py", line 75, in wait_for_condition
"conditions {4}.".format(self.group, self.plural, name, namespace, expected_conditions))
Exception: Timeout waiting for kubeflow.org/tfjobs mnist-e2e-2 in namespace profile-ref10 to enter one of the conditions ['Succeeded', 'Failed'].
In the meantime, please try to describe PVC to check which resources are using this volume:
kubectl describe pvc end-to-end-pipeline-6v2q4-model-volume -n profile-ref10
Sharing the result of this pvc:
Name: end-to-end-pipeline-6v2q4-model-volume
Namespace: profile-ref10
StorageClass: standard
Status: Bound
Volume: pvc-77985fef-a999-45ec-a2fb-2808a4df0675
Labels: <none>
Annotations: pv.kubernetes.io/bind-completed: yes
pv.kubernetes.io/bound-by-controller: yes
volume.beta.kubernetes.io/storage-provisioner: kubernetes.io/gce-pd
Finalizers: [kubernetes.io/pvc-protection]
Capacity: 1Gi
Access Modes: RWO
VolumeMode: Filesystem
Used By: mnist-e2e-2-chief-0
mnist-e2e-chief-0
mnist-e2e-predictor-default-00001-deployment-6fc58cc5b4-7d64k
mnist-e2e-predictor-default-00002-deployment-86b54db9f9-2qdq5
Events: <none>
Sharing the compiled pipeline definition of this sample:
https://gist.github.com/zijianjoy/e96c8cc6533e21e48cf193746831b089
Try to remove
delete_finished_tfjob=False
parameter in the TFJob op:If I remove the
delete_finished_tfjob
parameter, it will fail atKubeflow - Launch TFJob
step because it isException: Timeout waiting for kubeflow.org/tfjobs mnist-e2e-2 in namespace profile-ref10 to enter one of the conditions ['Succeeded', 'Failed'].
Logs are
INFO:root:Generating tfjob template. INFO:root:Creating kubeflow.org/tfjobs mnist-e2e-2 in namespace profile-ref10. INFO:root:Created kubeflow.org/tfjobs mnist-e2e-2 in namespace profile-ref10. INFO:root:Current condition of kubeflow.org/tfjobs mnist-e2e-2 in namespace profile-ref10 is Created. ... INFO:root:Current condition of kubeflow.org/tfjobs mnist-e2e-2 in namespace profile-ref10 is Created. INFO:root:Current condition of kubeflow.org/tfjobs mnist-e2e-2 in namespace profile-ref10 is Created. INFO:root:Current condition of kubeflow.org/tfjobs mnist-e2e-2 in namespace profile-ref10 is Created. INFO:root:Current condition of kubeflow.org/tfjobs mnist-e2e-2 in namespace profile-ref10 is Created. INFO:root:Current condition of kubeflow.org/tfjobs mnist-e2e-2 in namespace profile-ref10 is Created. INFO:root:Current condition of kubeflow.org/tfjobs mnist-e2e-2 in namespace profile-ref10 is Created. INFO:root:Current condition of kubeflow.org/tfjobs mnist-e2e-2 in namespace profile-ref10 is Created. ... INFO:root:Current condition of kubeflow.org/tfjobs mnist-e2e-2 in namespace profile-ref10 is Created. Traceback (most recent call last): File "/ml/launch_tfjob.py", line 136, in <module> main() File "/ml/launch_tfjob.py", line 131, in main timeout=datetime.timedelta(minutes=args.tfjobTimeoutMinutes)) File "/ml/launch_crd.py", line 75, in wait_for_condition "conditions {4}.".format(self.group, self.plural, name, namespace, expected_conditions)) Exception: Timeout waiting for kubeflow.org/tfjobs mnist-e2e-2 in namespace profile-ref10 to enter one of the conditions ['Succeeded', 'Failed'].
Can you double check why TFJob is in the Created
status ?
Describe the mnist-e2e-2
TFJob, check the pod logs.
List of pods in kubeflow
and user profile
namespaces:
NAMESPACE NAME READY STATUS RESTARTS AGE
kubeflow admission-webhook-deployment-6748884cff-r6d76 1/1 Running 0 5d1h
kubeflow cache-deployer-deployment-7d787d9dd9-6bcj6 2/2 Running 1 5d1h
kubeflow cache-server-5948c9cd48-mp5s5 2/2 Running 1 5d1h
kubeflow centraldashboard-cc9574b95-dg6w6 1/1 Running 0 5d1h
kubeflow cloud-endpoints-controller-67c7b4b7f7-59nqk 1/1 Running 0 5d1h
kubeflow cloudsqlproxy-d7f4c7b86-bp84k 2/2 Running 1 5d1h
kubeflow jupyter-web-app-deployment-6f75d7796-fzqpg 1/1 Running 0 5d1h
kubeflow katib-controller-6d4d54cff7-gvvw7 1/1 Running 0 5d1h
kubeflow katib-db-manager-697d8ddd69-l7z7l 1/1 Running 0 5d1h
kubeflow katib-mysql-6fd88c9c89-nhggp 1/1 Running 0 5d1h
kubeflow katib-ui-74d89ffbfb-sd8zr 1/1 Running 0 5d1h
kubeflow kfserving-controller-manager-0 2/2 Running 0 5d1h
kubeflow kubeflow-pipelines-profile-controller-d87fff898-84ljp 1/1 Running 0 5d1h
kubeflow metacontroller-0 1/1 Running 0 5d1h
kubeflow metadata-envoy-deployment-7db956c6f5-dlcmd 1/1 Running 0 5d1h
kubeflow metadata-grpc-deployment-6d85777b7c-lxc8f 2/2 Running 2 5d1h
kubeflow metadata-writer-789b57984d-lrtc8 2/2 Running 7 5d1h
kubeflow minio-644b8b6bc5-qlf9p 2/2 Running 0 5d1h
kubeflow ml-pipeline-74769d8bf5-pw54l 2/2 Running 0 5d1h
kubeflow ml-pipeline-persistenceagent-db88647d5-h2vss 2/2 Running 0 2d2h
kubeflow ml-pipeline-scheduledworkflow-898c4b88f-4pnd5 2/2 Running 0 5d1h
kubeflow ml-pipeline-ui-8597f9fb9-2tbnm 2/2 Running 0 5d1h
kubeflow ml-pipeline-viewer-crd-67dbdc8d86-ffckf 2/2 Running 1 5d1h
kubeflow ml-pipeline-visualizationserver-57566d4c45-ktzvg 2/2 Running 0 5d1h
kubeflow notebook-controller-deployment-65d666b7b-nhs4v 1/1 Running 0 5d1h
kubeflow profiles-deployment-588c449459-tlhd2 2/2 Running 0 5d1h
kubeflow pytorch-operator-b89ff96cc-ztrrp 2/2 Running 2 5d1h
kubeflow tensorboard-controller-controller-manager-7b955d8f69-b8t4q 3/3 Running 14 5d1h
kubeflow tensorboards-web-app-deployment-64b569989d-jgc2n 1/1 Running 0 5d1h
kubeflow tf-job-operator-6fd4d8d869-72dng 1/1 Running 12 5d1h
kubeflow volumes-web-app-deployment-7df5bd657c-gcjdt 1/1 Running 0 5d1h
kubeflow workflow-controller-6855d48fd7-x5nf6 2/2 Running 2 5d1h
profile-ref10 end-to-end-pipeline-79ngw-1429492901 0/1 Completed 0 3h46m
profile-ref10 end-to-end-pipeline-79ngw-2227434625 0/2 Error 0 3h46m
profile-ref10 end-to-end-pipeline-79ngw-2393645640 0/1 Completed 0 3h46m
profile-ref10 end-to-end-pipeline-79ngw-3092938776 0/1 Completed 0 3h46m
profile-ref10 end-to-end-pipeline-9s5fb-1797256841 0/1 Completed 0 138m
profile-ref10 end-to-end-pipeline-9s5fb-2312731595 0/2 Completed 0 135m
profile-ref10 end-to-end-pipeline-9s5fb-2742465402 0/2 Error 0 135m
profile-ref10 end-to-end-pipeline-9s5fb-3438513990 0/2 Completed 0 138m
profile-ref10 end-to-end-pipeline-lk2q6-1163786406 0/2 Completed 0 3h40m
profile-ref10 end-to-end-pipeline-lk2q6-1403031006 0/1 Completed 0 3h43m
profile-ref10 end-to-end-pipeline-lk2q6-2126139547 0/2 Completed 0 3h43m
profile-ref10 end-to-end-pipeline-lk2q6-362872075 0/2 Error 0 3h39m
profile-ref10 gpu-smoke-check-l5ctv-833024161 0/2 Pending 0 5d1h
profile-ref10 ml-pipeline-ui-artifact-dd9b8f4bf-t6hfp 2/2 Running 0 5d1h
profile-ref10 ml-pipeline-visualizationserver-85b5c575db-w8r97 2/2 Running 0 5d1h
profile-ref10 mnist-e2e-2-chief-0 0/1 ContainerCreating 0 3h39m
profile-ref10 mnist-e2e-2-random-6b7d58d778-pl9rk 1/1 Running 0 3h43m
profile-ref10 mnist-e2e-2-worker-0 1/1 Running 0 3h39m
profile-ref10 mnist-e2e-chief-0 0/1 Completed 0 4d23h
profile-ref10 mnist-e2e-predictor-default-00001-deployment-6fc58cc5b4-7d64k 2/2 Running 0 4d23h
profile-ref10 mnist-e2e-predictor-default-00002-deployment-86b54db9f9-2qdq5 0/2 Init:0/1 0 4d23h
profile-ref10 mnist-e2e-random-69f8755785-qcc7x 1/1 Running 0 4d23h
profile-ref10 mnist-e2e-ui-chief-0 0/1 ContainerCreating 0 135m
profile-ref10 mnist-e2e-ui-random-684c78955d-2885t 1/1 Running 0 138m
profile-ref10 mnist-e2e-ui-worker-0 1/1 Running 0 135m
profile-ref10 notebook-0 2/2 Running 0 4d23h
Describe TFJob pod: kubectl describe pod end-to-end-pipeline-lk2q6-362872075 -n profile-ref10
Name: end-to-end-pipeline-lk2q6-362872075
Namespace: profile-ref10
Priority: 0
Node: gke-kf-ref10-default-pool-50caf419-db2h/10.138.15.207
Start Time: Wed, 19 May 2021 16:46:17 +0000
Labels: pipelines.kubeflow.org/cache_enabled=true
pipelines.kubeflow.org/cache_id=
pipelines.kubeflow.org/metadata_context_id=71
pipelines.kubeflow.org/metadata_execution_id=89
pipelines.kubeflow.org/metadata_written=true
workflows.argoproj.io/completed=true
workflows.argoproj.io/workflow=end-to-end-pipeline-lk2q6
Annotations: pipelines.kubeflow.org/arguments.parameters:
{"ActiveDeadlineSeconds": "-1", "BackoffLimit": "-1", "Chief Spec": "{\"replicas\": 1, \"restartPolicy\": \"OnFailure\", \"template\": {\"...
pipelines.kubeflow.org/component_ref:
{"digest": "98fd1829c0a5ff1b36c53bc2a8d6482d33a7a5dc042e2ee661840be06065c325", "url": "https://raw.githubusercontent.com/kubeflow/pipeline...
pipelines.kubeflow.org/component_spec:
{"description": "Kubeflow TFJob launcher", "implementation": {"container": {"args": ["--name", {"inputValue": "Name"}, "--namespace", {"in...
pipelines.kubeflow.org/execution_cache_key: a2f1350d52463289d452d514614c5c768bba05ae87b40ac0cee77e07d28b663c
pipelines.kubeflow.org/metadata_input_artifact_ids: []
pipelines.kubeflow.org/metadata_output_artifact_ids:
[{"id": 160, "name": "main-logs", "uri": "minio://jamxl-kfp-dev-kfp/artifacts/end-to-end-pipeline-lk2q6/end-to-end-pipeline-lk2q6-36287207...
sidecar.istio.io/inject: false
workflows.argoproj.io/node-name: end-to-end-pipeline-lk2q6.kubeflow-launch-tfjob
workflows.argoproj.io/outputs:
{"artifacts":[{"name":"main-logs","archiveLogs":true,"s3":{"endpoint":"minio-service.kubeflow:9000","bucket":"jamxl-kfp-dev-kfp","insecure...
workflows.argoproj.io/template:
{"name":"kubeflow-launch-tfjob","arguments":{},"inputs":{"parameters":[{"name":"convert-katib-results-Output","value":"--tf-learning-rate=...
Status: Failed
IP: 10.108.0.127
IPs:
IP: 10.108.0.127
Controlled By: Workflow/end-to-end-pipeline-lk2q6
Containers:
wait:
Container ID: docker://7e06f6e7269576c0ea141c69301189a44919863e7a595b2b3c8f096fd4167d0d
Image: gcr.io/ml-pipeline/argoexec:v2.12.9-license-compliance
Image ID: docker-pullable://gcr.io/ml-pipeline/argoexec@sha256:9b512a260a215d65054685511dbd867184b0041721c51126c8ef478b4f32bd5e
Port: <none>
Host Port: <none>
Command:
argoexec
wait
State: Terminated
Reason: Completed
Exit Code: 0
Started: Wed, 19 May 2021 16:46:18 +0000
Finished: Wed, 19 May 2021 17:45:54 +0000
Ready: False
Restart Count: 0
Environment:
ARGO_POD_NAME: end-to-end-pipeline-lk2q6-362872075 (v1:metadata.name)
Mounts:
/argo/podmetadata from podmetadata (rw)
/argo/secret/mlpipeline-minio-artifact from mlpipeline-minio-artifact (ro)
/var/run/docker.sock from docker-sock (ro)
/var/run/secrets/kubernetes.io/serviceaccount from default-editor-token-9bv2r (ro)
main:
Container ID: docker://ff7e0163d8e7936a5140fb765480e7a33b99e3e4491102c9579daf6b8e685290
Image: nikenano/launchernew:latest
Image ID: docker-pullable://nikenano/launchernew@sha256:79c9599a50959541854856b4069529a7bc0fa312eebcec2ee0ab31d756b5eb9c
Port: <none>
Host Port: <none>
Command:
python
/ml/launch_tfjob.py
Args:
--name
mnist-e2e-2
--namespace
profile-ref10
--version
v1
--activeDeadlineSeconds
-1
--backoffLimit
-1
--cleanPodPolicy
Running
--ttlSecondsAfterFinished
-1
--psSpec
{}
--workerSpec
{"replicas": 1, "restartPolicy": "OnFailure", "template": {"metadata": {"annotations": {"sidecar.istio.io/inject": "false"}}, "spec": {"containers": [{"name": "tensorflow", "image": "docker.io/liuhougangxa/tf-estimator-mnist", "command": ["sh", "-c"], "args": ["python /opt/model.py --tf-export-dir=/mnt/export --tf-train-steps=200 --tf-learning-rate=0.04953975758073499 --tf-batch-size=88"]}]}}}
--chiefSpec
{"replicas": 1, "restartPolicy": "OnFailure", "template": {"metadata": {"annotations": {"sidecar.istio.io/inject": "false"}}, "spec": {"containers": [{"name": "tensorflow", "image": "docker.io/liuhougangxa/tf-estimator-mnist", "command": ["sh", "-c"], "args": ["python /opt/model.py --tf-export-dir=/mnt/export --tf-train-steps=200 --tf-learning-rate=0.04953975758073499 --tf-batch-size=88"], "volumeMounts": [{"mountPath": "/mnt/export", "name": "model-volume"}]}], "volumes": [{"name": "model-volume", "persistentVolumeClaim": {"claimName": "end-to-end-pipeline-6v2q4-model-volume"}}]}}}
--evaluatorSpec
{}
--tfjobTimeoutMinutes
60
--deleteAfterDone
True
State: Terminated
Reason: Error
Exit Code: 1
Started: Wed, 19 May 2021 16:46:19 +0000
Finished: Wed, 19 May 2021 17:45:53 +0000
Ready: False
Restart Count: 0
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-editor-token-9bv2r (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
podmetadata:
Type: DownwardAPI (a volume populated by information about the pod)
Items:
metadata.annotations -> annotations
docker-sock:
Type: HostPath (bare host directory volume)
Path: /var/run/docker.sock
HostPathType: Socket
mlpipeline-minio-artifact:
Type: Secret (a volume populated by a Secret)
SecretName: mlpipeline-minio-artifact
Optional: false
default-editor-token-9bv2r:
Type: Secret (a volume populated by a Secret)
SecretName: default-editor-token-9bv2r
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events: <none>
pod logs:
wait
container: kubectl logs end-to-end-pipeline-lk2q6-362872075 -n profile-ref10 --container=wait
time="2021-05-19T16:46:18.387Z" level=info msg="Starting Workflow Executor" version=v2.12.9
time="2021-05-19T16:46:18.437Z" level=info msg="Creating a docker executor"
time="2021-05-19T16:46:18.438Z" level=info msg="Executor (version: v2.12.9, build_date: 2021-02-16T22:51:48Z) initialized (pod: profile-ref10/end-to-end-pipeline-lk2q6-362872075) with template:\n{\"name\":\"kubeflow-launch-tfjob\",\"arguments\":{},\"inputs\":{\"parameters\":[{\"name\":\"convert-katib-results-Output\",\"value\":\"--tf-learning-rate=0.04953975758073499 --tf-batch-size=88\"},{\"name\":\"model-volume-name\",\"value\":\"end-to-end-pipeline-6v2q4-model-volume\"},{\"name\":\"name\",\"value\":\"mnist-e2e-2\"},{\"name\":\"namespace\",\"value\":\"profile-ref10\"},{\"name\":\"training_steps\",\"value\":\"200\"}]},\"outputs\":{},\"metadata\":{\"annotations\":{\"pipelines.kubeflow.org/arguments.parameters\":\"{\\\"ActiveDeadlineSeconds\\\": \\\"-1\\\", \\\"BackoffLimit\\\": \\\"-1\\\", \\\"Chief Spec\\\": \\\"{\\\\\\\"replicas\\\\\\\": 1, \\\\\\\"restartPolicy\\\\\\\": \\\\\\\"OnFailure\\\\\\\", \\\\\\\"template\\\\\\\": {\\\\\\\"metadata\\\\\\\": {\\\\\\\"annotations\\\\\\\": {\\\\\\\"sidecar.istio.io/inject\\\\\\\": \\\\\\\"false\\\\\\\"}}, \\\\\\\"spec\\\\\\\": {\\\\\\\"containers\\\\\\\": [{\\\\\\\"name\\\\\\\": \\\\\\\"tensorflow\\\\\\\", \\\\\\\"image\\\\\\\": \\\\\\\"docker.io/liuhougangxa/tf-estimator-mnist\\\\\\\", \\\\\\\"command\\\\\\\": [\\\\\\\"sh\\\\\\\", \\\\\\\"-c\\\\\\\"], \\\\\\\"args\\\\\\\": [\\\\\\\"python /opt/model.py --tf-export-dir=/mnt/export --tf-train-steps=200 --tf-learning-rate=0.04953975758073499 --tf-batch-size=88\\\\\\\"], \\\\\\\"volumeMounts\\\\\\\": [{\\\\\\\"mountPath\\\\\\\": \\\\\\\"/mnt/export\\\\\\\", \\\\\\\"name\\\\\\\": \\\\\\\"model-volume\\\\\\\"}]}], \\\\\\\"volumes\\\\\\\": [{\\\\\\\"name\\\\\\\": \\\\\\\"model-volume\\\\\\\", \\\\\\\"persistentVolumeClaim\\\\\\\": {\\\\\\\"claimName\\\\\\\": \\\\\\\"end-to-end-pipeline-6v2q4-model-volume\\\\\\\"}}]}}}\\\", \\\"CleanPodPolicy\\\": \\\"Running\\\", \\\"Delete Finished Tfjob\\\": \\\"True\\\", \\\"Evaluator Spec\\\": \\\"{}\\\", \\\"Name\\\": \\\"mnist-e2e-2\\\", \\\"Namespace\\\": \\\"profile-ref10\\\", \\\"PS Spec\\\": \\\"{}\\\", \\\"Tfjob Timeout Minutes\\\": \\\"60\\\", \\\"Version\\\": \\\"v1\\\", \\\"Worker Spec\\\": \\\"{\\\\\\\"replicas\\\\\\\": 1, \\\\\\\"restartPolicy\\\\\\\": \\\\\\\"OnFailure\\\\\\\", \\\\\\\"template\\\\\\\": {\\\\\\\"metadata\\\\\\\": {\\\\\\\"annotations\\\\\\\": {\\\\\\\"sidecar.istio.io/inject\\\\\\\": \\\\\\\"false\\\\\\\"}}, \\\\\\\"spec\\\\\\\": {\\\\\\\"containers\\\\\\\": [{\\\\\\\"name\\\\\\\": \\\\\\\"tensorflow\\\\\\\", \\\\\\\"image\\\\\\\": \\\\\\\"docker.io/liuhougangxa/tf-estimator-mnist\\\\\\\", \\\\\\\"command\\\\\\\": [\\\\\\\"sh\\\\\\\", \\\\\\\"-c\\\\\\\"], \\\\\\\"args\\\\\\\": [\\\\\\\"python /opt/model.py --tf-export-dir=/mnt/export --tf-train-steps=200 --tf-learning-rate=0.04953975758073499 --tf-batch-size=88\\\\\\\"]}]}}}\\\", \\\"ttl Seconds After Finished\\\": \\\"-1\\\"}\",\"pipelines.kubeflow.org/component_ref\":\"{\\\"digest\\\": \\\"98fd1829c0a5ff1b36c53bc2a8d6482d33a7a5dc042e2ee661840be06065c325\\\", \\\"url\\\": \\\"https://raw.githubusercontent.com/kubeflow/pipelines/master/components/kubeflow/launcher/component.yaml\\\"}\",\"pipelines.kubeflow.org/component_spec\":\"{\\\"description\\\": \\\"Kubeflow TFJob launcher\\\", \\\"implementation\\\": {\\\"container\\\": {\\\"args\\\": [\\\"--name\\\", {\\\"inputValue\\\": \\\"Name\\\"}, \\\"--namespace\\\", {\\\"inputValue\\\": \\\"Namespace\\\"}, \\\"--version\\\", {\\\"inputValue\\\": \\\"Version\\\"}, \\\"--activeDeadlineSeconds\\\", {\\\"inputValue\\\": \\\"ActiveDeadlineSeconds\\\"}, \\\"--backoffLimit\\\", {\\\"inputValue\\\": \\\"BackoffLimit\\\"}, \\\"--cleanPodPolicy\\\", {\\\"inputValue\\\": \\\"CleanPodPolicy\\\"}, \\\"--ttlSecondsAfterFinished\\\", {\\\"inputValue\\\": \\\"ttl Seconds After Finished\\\"}, \\\"--psSpec\\\", {\\\"inputValue\\\": \\\"PS Spec\\\"}, \\\"--workerSpec\\\", {\\\"inputValue\\\": \\\"Worker Spec\\\"}, \\\"--chiefSpec\\\", {\\\"inputValue\\\": \\\"Chief Spec\\\"}, \\\"--evaluatorSpec\\\", {\\\"inputValue\\\": \\\"Evaluator Spec\\\"}, \\\"--tfjobTimeoutMinutes\\\", {\\\"inputValue\\\": \\\"Tfjob Timeout Minutes\\\"}, \\\"--deleteAfterDone\\\", {\\\"inputValue\\\": \\\"Delete Finished Tfjob\\\"}], \\\"command\\\": [\\\"python\\\", \\\"/ml/launch_tfjob.py\\\"], \\\"image\\\": \\\"nikenano/launchernew:latest\\\"}}, \\\"inputs\\\": [{\\\"description\\\": \\\"TFJob name.\\\", \\\"name\\\": \\\"Name\\\", \\\"type\\\": \\\"String\\\"}, {\\\"default\\\": \\\"kubeflow\\\", \\\"description\\\": \\\"TFJob namespace.\\\", \\\"name\\\": \\\"Namespace\\\", \\\"type\\\": \\\"String\\\"}, {\\\"default\\\": \\\"v1\\\", \\\"description\\\": \\\"TFJob version.\\\", \\\"name\\\": \\\"Version\\\", \\\"type\\\": \\\"String\\\"}, {\\\"default\\\": -1, \\\"description\\\": \\\"Specifies the duration (in seconds) since startTime during which the job can remain active before it is terminated. Must be a positive integer. This setting applies only to pods where restartPolicy is OnFailure or Always.\\\", \\\"name\\\": \\\"ActiveDeadlineSeconds\\\", \\\"type\\\": \\\"Integer\\\"}, {\\\"default\\\": -1, \\\"description\\\": \\\"Number of retries before marking this job as failed.\\\", \\\"name\\\": \\\"BackoffLimit\\\", \\\"type\\\": \\\"Integer\\\"}, {\\\"default\\\": -1, \\\"description\\\": \\\"Defines the TTL for cleaning up finished TFJobs.\\\", \\\"name\\\": \\\"ttl Seconds After Finished\\\", \\\"type\\\": \\\"Integer\\\"}, {\\\"default\\\": \\\"Running\\\", \\\"description\\\": \\\"Defines the policy for cleaning up pods after the TFJob completes.\\\", \\\"name\\\": \\\"CleanPodPolicy\\\", \\\"type\\\": \\\"String\\\"}, {\\\"default\\\": \\\"{}\\\", \\\"description\\\": \\\"TFJob ps replicaSpecs.\\\", \\\"name\\\": \\\"PS Spec\\\", \\\"type\\\": \\\"JsonObject\\\"}, {\\\"default\\\": \\\"{}\\\", \\\"description\\\": \\\"TFJob worker replicaSpecs.\\\", \\\"name\\\": \\\"Worker Spec\\\", \\\"type\\\": \\\"JsonObject\\\"}, {\\\"default\\\": \\\"{}\\\", \\\"description\\\": \\\"TFJob chief replicaSpecs.\\\", \\\"name\\\": \\\"Chief Spec\\\", \\\"type\\\": \\\"JsonObject\\\"}, {\\\"default\\\": \\\"{}\\\", \\\"description\\\": \\\"TFJob evaluator replicaSpecs.\\\", \\\"name\\\": \\\"Evaluator Spec\\\", \\\"type\\\": \\\"JsonObject\\\"}, {\\\"default\\\": 1440, \\\"description\\\": \\\"Time in minutes to wait for the TFJob to complete.\\\", \\\"name\\\": \\\"Tfjob Timeout Minutes\\\", \\\"type\\\": \\\"Integer\\\"}, {\\\"default\\\": \\\"True\\\", \\\"description\\\": \\\"Whether to delete the tfjob after it is finished.\\\", \\\"name\\\": \\\"Delete Finished Tfjob\\\", \\\"type\\\": \\\"Bool\\\"}], \\\"name\\\": \\\"Kubeflow - Launch TFJob\\\"}\",\"sidecar.istio.io/inject\":\"false\"},\"labels\":{\"pipelines.kubeflow.org/cache_enabled\":\"true\"}},\"container\":{\"name\":\"\",\"image\":\"nikenano/launchernew:latest\",\"command\":[\"python\",\"/ml/launch_tfjob.py\"],\"args\":[\"--name\",\"mnist-e2e-2\",\"--namespace\",\"profile-ref10\",\"--version\",\"v1\",\"--activeDeadlineSeconds\",\"-1\",\"--backoffLimit\",\"-1\",\"--cleanPodPolicy\",\"Running\",\"--ttlSecondsAfterFinished\",\"-1\",\"--psSpec\",\"{}\",\"--workerSpec\",\"{\\\"replicas\\\": 1, \\\"restartPolicy\\\": \\\"OnFailure\\\", \\\"template\\\": {\\\"metadata\\\": {\\\"annotations\\\": {\\\"sidecar.istio.io/inject\\\": \\\"false\\\"}}, \\\"spec\\\": {\\\"containers\\\": [{\\\"name\\\": \\\"tensorflow\\\", \\\"image\\\": \\\"docker.io/liuhougangxa/tf-estimator-mnist\\\", \\\"command\\\": [\\\"sh\\\", \\\"-c\\\"], \\\"args\\\": [\\\"python /opt/model.py --tf-export-dir=/mnt/export --tf-train-steps=200 --tf-learning-rate=0.04953975758073499 --tf-batch-size=88\\\"]}]}}}\",\"--chiefSpec\",\"{\\\"replicas\\\": 1, \\\"restartPolicy\\\": \\\"OnFailure\\\", \\\"template\\\": {\\\"metadata\\\": {\\\"annotations\\\": {\\\"sidecar.istio.io/inject\\\": \\\"false\\\"}}, \\\"spec\\\": {\\\"containers\\\": [{\\\"name\\\": \\\"tensorflow\\\", \\\"image\\\": \\\"docker.io/liuhougangxa/tf-estimator-mnist\\\", \\\"command\\\": [\\\"sh\\\", \\\"-c\\\"], \\\"args\\\": [\\\"python /opt/model.py --tf-export-dir=/mnt/export --tf-train-steps=200 --tf-learning-rate=0.04953975758073499 --tf-batch-size=88\\\"], \\\"volumeMounts\\\": [{\\\"mountPath\\\": \\\"/mnt/export\\\", \\\"name\\\": \\\"model-volume\\\"}]}], \\\"volumes\\\": [{\\\"name\\\": \\\"model-volume\\\", \\\"persistentVolumeClaim\\\": {\\\"claimName\\\": \\\"end-to-end-pipeline-6v2q4-model-volume\\\"}}]}}}\",\"--evaluatorSpec\",\"{}\",\"--tfjobTimeoutMinutes\",\"60\",\"--deleteAfterDone\",\"True\"],\"resources\":{}},\"archiveLocation\":{\"archiveLogs\":true,\"s3\":{\"endpoint\":\"minio-service.kubeflow:9000\",\"bucket\":\"jamxl-kfp-dev-kfp\",\"insecure\":true,\"accessKeySecret\":{\"name\":\"mlpipeline-minio-artifact\",\"key\":\"accesskey\"},\"secretKeySecret\":{\"name\":\"mlpipeline-minio-artifact\",\"key\":\"secretkey\"},\"key\":\"artifacts/end-to-end-pipeline-lk2q6/end-to-end-pipeline-lk2q6-362872075\"}}}"
time="2021-05-19T16:46:18.438Z" level=info msg="Waiting on main container"
time="2021-05-19T16:46:20.149Z" level=info msg="main container started with container ID: ff7e0163d8e7936a5140fb765480e7a33b99e3e4491102c9579daf6b8e685290"
time="2021-05-19T16:46:20.149Z" level=info msg="Starting annotations monitor"
time="2021-05-19T16:46:20.156Z" level=info msg="docker wait ff7e0163d8e7936a5140fb765480e7a33b99e3e4491102c9579daf6b8e685290"
time="2021-05-19T16:46:20.156Z" level=info msg="Starting deadline monitor"
time="2021-05-19T16:46:30.156Z" level=info msg="/argo/podmetadata/annotations updated"
time="2021-05-19T16:51:18.438Z" level=info msg="Alloc=5003 TotalAlloc=15663 Sys=70848 NumGC=7 Goroutines=12"
time="2021-05-19T16:56:18.438Z" level=info msg="Alloc=5005 TotalAlloc=15675 Sys=70848 NumGC=9 Goroutines=12"
time="2021-05-19T17:01:18.438Z" level=info msg="Alloc=5003 TotalAlloc=15688 Sys=70848 NumGC=12 Goroutines=12"
time="2021-05-19T17:06:18.438Z" level=info msg="Alloc=5005 TotalAlloc=15701 Sys=70848 NumGC=14 Goroutines=12"
time="2021-05-19T17:11:18.438Z" level=info msg="Alloc=5003 TotalAlloc=15713 Sys=70848 NumGC=17 Goroutines=12"
time="2021-05-19T17:16:18.438Z" level=info msg="Alloc=5005 TotalAlloc=15726 Sys=70848 NumGC=19 Goroutines=12"
time="2021-05-19T17:21:18.438Z" level=info msg="Alloc=5004 TotalAlloc=15739 Sys=70848 NumGC=22 Goroutines=12"
time="2021-05-19T17:26:18.438Z" level=info msg="Alloc=5005 TotalAlloc=15751 Sys=70848 NumGC=24 Goroutines=12"
time="2021-05-19T17:31:18.438Z" level=info msg="Alloc=5004 TotalAlloc=15763 Sys=70848 NumGC=27 Goroutines=12"
time="2021-05-19T17:36:18.438Z" level=info msg="Alloc=5005 TotalAlloc=15776 Sys=70848 NumGC=29 Goroutines=12"
time="2021-05-19T17:41:18.438Z" level=info msg="Alloc=4855 TotalAlloc=15789 Sys=70848 NumGC=32 Goroutines=10"
time="2021-05-19T17:45:54.143Z" level=info msg="Main container completed"
time="2021-05-19T17:45:54.144Z" level=info msg="No Script output reference in workflow. Capturing script output ignored"
time="2021-05-19T17:45:54.144Z" level=info msg="Capturing script exit code"
time="2021-05-19T17:45:54.144Z" level=info msg="Annotations monitor stopped"
time="2021-05-19T17:45:54.185Z" level=info msg="Saving logs"
time="2021-05-19T17:45:54.185Z" level=info msg="[docker logs ff7e0163d8e7936a5140fb765480e7a33b99e3e4491102c9579daf6b8e685290]"
time="2021-05-19T17:45:54.217Z" level=info msg="S3 Save path: /tmp/argo/outputs/logs/main.log, key: artifacts/end-to-end-pipeline-lk2q6/end-to-end-pipeline-lk2q6-362872075/main.log"
time="2021-05-19T17:45:54.218Z" level=info msg="Creating minio client minio-service.kubeflow:9000 using static credentials"
time="2021-05-19T17:45:54.218Z" level=info msg="Saving from /tmp/argo/outputs/logs/main.log to s3 (endpoint: minio-service.kubeflow:9000, bucket: jamxl-kfp-dev-kfp, key: artifacts/end-to-end-pipeline-lk2q6/end-to-end-pipeline-lk2q6-362872075/main.log)"
time="2021-05-19T17:45:54.467Z" level=info msg="No output parameters"
time="2021-05-19T17:45:54.467Z" level=info msg="No output artifacts"
time="2021-05-19T17:45:54.467Z" level=info msg="Annotating pod with output"
time="2021-05-19T17:45:54.490Z" level=info msg="Killing sidecars"
time="2021-05-19T17:45:54.495Z" level=info msg="Alloc=6429 TotalAlloc=17373 Sys=70848 NumGC=34 Goroutines=10"
main
container:
Same as Exception: Timeout waiting for kubeflow.org/tfjobs mnist-e2e-2 in namespace profile-ref10 to enter one of the conditions ['Succeeded', 'Failed'].
Please can you also describe the TFJob pods, not a Kubeflow Pipeline task?
kubectl describe pod -n profile-ref10 mnist-e2e-ui-chief-0
And TFJob:
kubectl describe tfjob mnist-e2e -n profile-ref10
I see, thanks for clarification!
Sharing the TFJob pod: kubectl describe pod -n profile-ref10 mnist-e2e-ui-chief-0
Name: mnist-e2e-ui-chief-0
Namespace: profile-ref10
Priority: 0
Node: gke-kf-ref10-default-pool-50caf419-s900/10.138.15.208
Start Time: Wed, 19 May 2021 18:10:40 +0000
Labels: group-name=kubeflow.org
job-name=mnist-e2e-ui
job-role=master
replica-index=0
replica-type=chief
Annotations: sidecar.istio.io/inject: false
Status: Pending
IP:
IPs: <none>
Controlled By: TFJob/mnist-e2e-ui
Containers:
tensorflow:
Container ID:
Image: docker.io/liuhougangxa/tf-estimator-mnist
Image ID:
Port: 2222/TCP
Host Port: 0/TCP
Command:
sh
-c
Args:
python /opt/model.py --tf-export-dir=/mnt/export --tf-train-steps=200 --tf-learning-rate=0.03830876327649775 --tf-batch-size=96
State: Waiting
Reason: ContainerCreating
Ready: False
Restart Count: 0
Environment:
TF_CONFIG: {"cluster":{"chief":["mnist-e2e-ui-chief-0.profile-ref10.svc:2222"],"worker":["mnist-e2e-ui-worker-0.profile-ref10.svc:2222"]},"task":{"type":"chief","index":0},"environment":"cloud"}
Mounts:
/mnt/export from model-volume (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-nbcbj (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
model-volume:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: end-to-end-pipeline-6v2q4-model-volume
ReadOnly: false
default-token-nbcbj:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-nbcbj
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedMount 6m4s (x58 over 171m) kubelet Unable to attach or mount volumes: unmounted volumes=[model-volume], unattached volumes=[model-volume default-token-nbcbj]: timed out waiting for the condition
Warning FailedMount 93s (x18 over 155m) kubelet Unable to attach or mount volumes: unmounted volumes=[model-volume], unattached volumes=[default-token-nbcbj model-volume]: timed out waiting for the condition
And TFJob: kubectl describe tfjob mnist-e2e-ui -n profile-ref10
Name: mnist-e2e-ui
Namespace: profile-ref10
Labels: <none>
Annotations: <none>
API Version: kubeflow.org/v1
Kind: TFJob
Metadata:
Creation Timestamp: 2021-05-19T18:10:40Z
Generation: 1
Managed Fields:
API Version: kubeflow.org/v1
Fields Type: FieldsV1
fieldsV1:
f:spec:
.:
f:tfReplicaSpecs:
.:
f:Chief:
.:
f:replicas:
f:restartPolicy:
f:template:
.:
f:metadata:
.:
f:annotations:
.:
f:sidecar.istio.io/inject:
f:spec:
.:
f:volumes:
f:Worker:
.:
f:replicas:
f:restartPolicy:
f:template:
.:
f:metadata:
.:
f:annotations:
.:
f:sidecar.istio.io/inject:
f:spec:
Manager: OpenAPI-Generator
Operation: Update
Time: 2021-05-19T18:10:40Z
API Version: kubeflow.org/v1
Fields Type: FieldsV1
fieldsV1:
f:spec:
f:runPolicy:
.:
f:cleanPodPolicy:
f:successPolicy:
f:tfReplicaSpecs:
f:Chief:
f:template:
f:metadata:
f:creationTimestamp:
f:spec:
f:containers:
f:Worker:
f:template:
f:metadata:
f:creationTimestamp:
f:spec:
f:containers:
f:status:
.:
f:conditions:
f:replicaStatuses:
.:
f:Chief:
f:Worker:
.:
f:active:
f:startTime:
Manager: tf-operator.v1
Operation: Update
Time: 2021-05-19T18:10:43Z
Resource Version: 8688633
Self Link: /apis/kubeflow.org/v1/namespaces/profile-ref10/tfjobs/mnist-e2e-ui
UID: 8be5d7c4-7012-49cc-8565-5327fafd67e4
Spec:
Clean Pod Policy: Running
Tf Replica Specs:
Chief:
Replicas: 1
Restart Policy: OnFailure
Template:
Metadata:
Annotations:
sidecar.istio.io/inject: false
Spec:
Containers:
Args:
python /opt/model.py --tf-export-dir=/mnt/export --tf-train-steps=200 --tf-learning-rate=0.03830876327649775 --tf-batch-size=96
Command:
sh
-c
Image: docker.io/liuhougangxa/tf-estimator-mnist
Name: tensorflow
Volume Mounts:
Mount Path: /mnt/export
Name: model-volume
Volumes:
Name: model-volume
Persistent Volume Claim:
Claim Name: end-to-end-pipeline-6v2q4-model-volume
Worker:
Replicas: 1
Restart Policy: OnFailure
Template:
Metadata:
Annotations:
sidecar.istio.io/inject: false
Spec:
Containers:
Args:
python /opt/model.py --tf-export-dir=/mnt/export --tf-train-steps=200 --tf-learning-rate=0.03830876327649775 --tf-batch-size=96
Command:
sh
-c
Image: docker.io/liuhougangxa/tf-estimator-mnist
Name: tensorflow
Status:
Conditions:
Last Transition Time: 2021-05-19T18:10:40Z
Last Update Time: 2021-05-19T18:10:40Z
Message: TFJob mnist-e2e-ui is created.
Reason: TFJobCreated
Status: True
Type: Created
Replica Statuses:
Chief:
Worker:
Active: 1
Start Time: 2021-05-19T18:10:40Z
Events: <none>
It seems that the problem with attaching volume to your pods. It takes long time. Do you have any specific setup or constraint for your GCP cluster ?
Also, check this: https://github.com/kubernetes/kubernetes/issues/67014#issuecomment-827629299.
Which version of GKE are you using ? Can you try to upgrade to 1.19
?
I am not aware of any specific constraint on my GCP cluster. I am using https://www.kubeflow.org/docs/distributions/gke/deploy/deploy-cli/ to deploy Kubeflow cluster.
I am not sure if upgrading Kubernetes version can help with this issue. Would you like to help me understand? My current Kubernetes version is 1.18.17-gke.100.
What is the Kubernetes version you are using which can finish volume mounting successfully?
I am using GKE cluster with 4 Nodes. Version is: 1.19.9-gke.1900
.
FYI, to clean-up all resources from this example, run this:
kubectl delete wf -n profile-ref10 --all
kubectl delete tfjob -n profile-ref10 --all
kubectl delete experiment -n profile-ref10 --all
kubectl delete isvc -n profile-ref10 --all
Thank you @andreyvelich , it is useful command to clean up past resources, maybe we can add them to the E2E example documentation.
I would like to share update from my side:
I had Kubernetes cluster with version1.18.17-gke.100
, and machine type is e2-standard-8
, and 2 nodes in default node pool. I have disabled autoscaling. And the image type Container-Optimized OS with Docker (cos)
.
Based on the recommendation, I have upgraded cluster master version and nodepool node version to 1.19.9-gke.1400
. Note that in this case, the image type doesn't change.
Instead of calling KFP to create a run directly, I used the following command to compile to yaml file, and upload to KFP from UI. (I set delete_finished_tfjob=False
):
from kfp import compiler
compiler.Compiler(mode=kfp.dsl.pipeline_executionmode_v2).compile(mnist_pipeline, 'mnist_pipeline_not_delete_tfjob.yaml')
With this approach, I am able to finish pipeline run and created inference service successfully!
However, I am not able to build connection to predict
endpoint as the final step of E2E example. Error message is
ApiException: (401)
Reason: Unauthorized
HTTP response headers: HTTPHeaderDict({'X-Goog-IAP-Generated-Response': 'true', 'Content-Length': '45', 'Content-Type': 'text/html; charset=UTF-8', 'Date': 'Fri, 21 May 2021 17:17:30 GMT', 'Alt-Svc': 'clear'})
HTTP response body: OpenID Connect token expired: JWT has expired
I guess it is about the IAP policy requirement for sending request to GCP endpoint.
Note that Kubeflow Pipelines actually doesn't work on Kubernetes 1.19: https://github.com/kubeflow/pipelines/issues/5714. The reason that pipeline can run in the Action
section above is because image type
doesn't change by node version upgrade. If we create a Kubernetes cluster starting at 1.19, then image type will be Container-Optimized OS with Containerd
. Then KFP will not work in this case. We might need to pin image type in GCP distribution if we upgrade k8s version to 1.19 for this E2E example. cc @Bobgy
Another caveat is that calling KFP to create run directly from notebook still fails with the same message as described at the beginning of this thread https://github.com/kubeflow/gcp-blueprints/issues/271#issuecomment-839200055, even when I have upgraded to k8s 1.19. Below is the command I use to initialize run:
# Run the Kubeflow Pipeline in the user's namespace.
kfp_client=kfp.Client(host='https://kf-ref9.endpoints.jamxl-kfp-dev.cloud.goog/pipeline',
client_id='xxxxx.apps.googleusercontent.com',
other_client_id='xxx.apps.googleusercontent.com',
other_client_secret='xxx')
run_id = kfp_client.create_run_from_pipeline_func(mnist_pipeline, namespace=namespace, arguments={}).run_id
print("Run ID: ", run_id)
What is the difference which caused the pipeline run failure when uploading from UI vs in-cluster notebook?
Thank you for providing this information @zijianjoy.
The reason that pipeline can run in the Action section above is because image type doesn't change by node version upgrade.
Yes, for my GCP cluster I am also using Container-Optimized OS with Docker (cos)
image type, since I upgraded GKE from older version.
However, I am not able to build connection to predict endpoint as the final step of E2E example. Error message is
I believe, if you want to call KFServing inference outside of K8s cluster (not from Kubeflow Notebooks) you should use another URL. @yuzisun Do we have an example how to call KFServing inference which is deployed on GCP ?
What is the difference which caused the pipeline run failure when uploading from UI vs in-cluster notebook?
It should not be any differences. Did you give your Kubeflow Notebook access to KFP (see this comment: https://github.com/kubeflow/pipelines/issues/5138#issuecomment-797003710) and extend KFP SDK with token credentials changes from this PR: https://github.com/kubeflow/pipelines/pull/5287 ? Basically, to run Pipelines in Multi-user mode.
Do we have any differences in Volume creation step (dsl.VolumeOp()
) if Pipeline is executed from Notebook or from the UI ?
https://github.com/kubeflow/pipelines/pull/5433