GoogleCloudPlatform / kubeflow-distribution

Blueprints for Deploying Kubeflow on Google Cloud Platform and Anthos
Apache License 2.0
75 stars 63 forks source link

Validate Kubeflow GCP deployment using E2E example in https://github.com/kubeflow/pipelines/pull/5433 #271

Open zijianjoy opened 3 years ago

zijianjoy commented 3 years ago

https://github.com/kubeflow/pipelines/pull/5433

zijianjoy commented 3 years ago

KFServing step has failed

servingfailed

Error message:

{'apiVersion': 'serving.kubeflow.org/v1beta1', 'kind': 'InferenceService', 'metadata': {'annotations': {'sidecar.istio.io/inject': 'false'}, 'creationTimestamp': '2021-05-11T21:34:33Z', 'finalizers': ['inferenceservice.finalizers'], 'generation': 1, 'managedFields': [{'apiVersion': 'serving.kubeflow.org/v1beta1', 'fieldsType': 'FieldsV1', 'fieldsV1': {'f:metadata': {'f:annotations': {'.': {}, 'f:sidecar.istio.io/inject': {}}}, 'f:spec': {'.': {}, 'f:predictor': {'.': {}, 'f:tensorflow': {'.': {}, 'f:storageUri': {}}}}}, 'manager': 'OpenAPI-Generator', 'operation': 'Update', 'time': '2021-05-11T21:34:31Z'}, {'apiVersion': 'serving.kubeflow.org/v1beta1', 'fieldsType': 'FieldsV1', 'fieldsV1': {'f:metadata': {'f:finalizers': {}}, 'f:spec': {'f:predictor': {'f:containers': {}, 'f:tensorflow': {'f:args': {}, 'f:command': {}, 'f:image': {}}}}, 'f:status': {'.': {}, 'f:components': {'.': {}, 'f:predictor': {'.': {}, 'f:latestCreatedRevision': {}, 'f:latestReadyRevision': {}, 'f:latestRolledoutRevision': {}, 'f:traffic': {}}}, 'f:conditions': {}}}, 'manager': 'manager', 'operation': 'Update', 'time': '2021-05-11T21:35:05Z'}], 'name': 'mnist-e2e', 'namespace': 'profile', 'resourceVersion': '651911', 'selfLink': '/apis/serving.kubeflow.org/v1beta1/namespaces/profile/inferenceservices/mnist-e2e', 'uid': '69783a07-cb35-41ae-8809-6c2efa15b954'}, 'spec': {'predictor': {'tensorflow': {'name': 'kfserving-container', 'resources': {'limits': {'cpu': '1', 'memory': '2Gi'}, 'requests': {'cpu': '1', 'memory': '2Gi'}}, 'runtimeVersion': '1.14.0', 'storageUri': 'pvc://end-to-end-pipeline-ztsm4-model-volume/'}}}, 'status': {'components': {'predictor': {'latestCreatedRevision': 'mnist-e2e-predictor-default-00002', 'latestReadyRevision': 'mnist-e2e-predictor-default-00001', 'latestRolledoutRevision': 'mnist-e2e-predictor-default-00001', 'traffic': [{'latestRevision': True, 'percent': 100, 'revisionName': 'mnist-e2e-predictor-default-00001'}]}}, 'conditions': [{'lastTransitionTime': '2021-05-11T21:35:05Z', 'severity': 'Info', 'status': 'True', 'type': 'PredictorConfigurationReady'}, {'lastTransitionTime': '2021-05-11T21:35:05Z', 'status': 'Unknown', 'type': 'PredictorReady'}, {'lastTransitionTime': '2021-05-11T21:34:36Z', 'severity': 'Info', 'status': 'Unknown', 'type': 'PredictorRouteReady'}, {'lastTransitionTime': '2021-05-11T21:35:05Z', 'status': 'Unknown', 'type': 'Ready'}]}} Model is timed out, please check the InferenceService events for more details.

zijianjoy commented 3 years ago

InferenceService Event Message:

Warning InternalError 22m v1beta1Controllers fails to reconcile predictor: fails to update knative service: Operation cannot be fulfilled on services.serving.knative.dev "mnist-e2e-predictor-default": the object has been modified; please apply your changes to the latest version and try again

andreyvelich commented 3 years ago

Thank you for validating this example @zijianjoy! Couple of questions:

  1. Do you see any other messages from InferenceService or only this warning ?
  2. Can you check the logs from the KFServing controller ?
  3. Does the InferenceService create pod with the deployed model ? If yes, can you check the logs please.

cc @yuzisun @pvaneck Do you have any ideas for this error ?

zijianjoy commented 3 years ago

Thank you @andreyvelich !

  1. I didn't see other error than that message: I ran kubectl describe inferenceservice mnist-e2e-2 -n profile and get the following:
Name:         mnist-e2e-2
Namespace:    profile
Labels:       <none>
Annotations:  sidecar.istio.io/inject: false
API Version:  serving.kubeflow.org/v1beta1
Kind:         InferenceService
Metadata:
  Creation Timestamp:  2021-05-11T22:42:13Z
  Finalizers:
    inferenceservice.finalizers
  Generation:  1
  Managed Fields:
    API Version:  serving.kubeflow.org/v1beta1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .:
          f:sidecar.istio.io/inject:
      f:spec:
        .:
        f:predictor:
          .:
          f:tensorflow:
            .:
            f:storageUri:
    Manager:      OpenAPI-Generator
    Operation:    Update
    Time:         2021-05-11T22:42:10Z
    API Version:  serving.kubeflow.org/v1beta1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:finalizers:
      f:spec:
        f:predictor:
          f:containers:
          f:tensorflow:
            f:args:
            f:command:
            f:image:
      f:status:
        .:
        f:components:
          .:
          f:predictor:
            .:
            f:latestCreatedRevision:
            f:latestReadyRevision:
            f:latestRolledoutRevision:
            f:traffic:
        f:conditions:
    Manager:         manager
    Operation:       Update
    Time:            2021-05-11T22:42:33Z
  Resource Version:  739007
  Self Link:         /apis/serving.kubeflow.org/v1beta1/namespaces/profile/inferenceservices/mnist-e2e-2
  UID:               98f645c8-b006-44c5-9754-149b8eba22a9
Spec:
  Predictor:
    Tensorflow:
      Name:  kfserving-container
      Resources:
        Limits:
          Cpu:     1
          Memory:  2Gi
        Requests:
          Cpu:          1
          Memory:       2Gi
      Runtime Version:  1.14.0
      Storage Uri:      pvc://end-to-end-pipeline-ztsm4-model-volume/
Status:
  Components:
    Predictor:
      Latest Created Revision:    mnist-e2e-2-predictor-default-00002
      Latest Ready Revision:      mnist-e2e-2-predictor-default-00001
      Latest Rolledout Revision:  mnist-e2e-2-predictor-default-00001
      Traffic:
        Latest Revision:  true
        Percent:          100
        Revision Name:    mnist-e2e-2-predictor-default-00001
  Conditions:
    Last Transition Time:  2021-05-11T22:42:33Z
    Severity:              Info
    Status:                True
    Type:                  PredictorConfigurationReady
    Last Transition Time:  2021-05-11T22:52:18Z
    Message:               Revision "mnist-e2e-2-predictor-default-00002" failed with message: .
    Reason:                RevisionFailed
    Status:                False
    Type:                  PredictorReady
    Last Transition Time:  2021-05-11T22:52:18Z
    Message:               Revision "mnist-e2e-2-predictor-default-00002" failed with message: .
    Reason:                RevisionFailed
    Severity:              Info
    Status:                False
    Type:                  PredictorRouteReady
    Last Transition Time:  2021-05-11T22:52:18Z
    Message:               Revision "mnist-e2e-2-predictor-default-00002" failed with message: .
    Reason:                RevisionFailed
    Status:                False
    Type:                  Ready
Events:                    <none>
  1. I ran kubectl logs kfserving-controller-manager-0 --container=manager and get message:

{"level":"info","ts":1620768876.3706956,"logger":"KsvcReconciler","msg":"knative service configuration diff (-desired, +observed):","diff":" v1.ConfigurationSpec{\n \tTemplate: v1.RevisionTemplateSpec{\n \t\tObjectMeta: {Labels: {\"component\": \"predictor\", \"serving.kubeflow.org/inferenceservice\": \"mnist-e2e\"}, Annotations: {\"autoscaling.knative.dev/class\": \"kpa.autoscaling.knative.dev\", \"autoscaling.knative.dev/minScale\": \"1\", \"internal.serving.kubeflow.org/storage-initializer-sourceuri\": \"pvc://end-to-end-pipeline-ztsm4-model-volume/\", \"sidecar.istio.io/inject\": \"false\"}},\n \t\tSpec: v1.RevisionSpec{\n \t\t\tPodSpec: v1.PodSpec{\n \t\t\t\t... // 28 identical fields\n \t\t\t\tReadinessGates: nil,\n \t\t\t\tRuntimeClassName: nil,\n- \t\t\t\tEnableServiceLinks: nil,\n+ \t\t\t\tEnableServiceLinks: &false,\n \t\t\t\tPreemptionPolicy: nil,\n \t\t\t\tOverhead: nil,\n \t\t\t\t... // 2 identical fields\n \t\t\t},\n \t\t\tContainerConcurrency: &0,\n \t\t\tTimeoutSeconds: &300,\n \t\t},\n \t},\n }\n"} {"level":"info","ts":1620768876.3707314,"logger":"KsvcReconciler","msg":"Updating knative service","namespace":"profile","name":"mnist-e2e-predictor-default"} {"level":"info","ts":1620768876.4037502,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling ingress for inference service","isvc":"mnist-e2e"} {"level":"info","ts":1620768876.4270444,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling inference service","apiVersion":"serving.kubeflow.org/v1beta1","isvc":"mnist-e2e"} {"level":"info","ts":1620768876.4271548,"logger":"PredictorReconciler","msg":"Reconciling Predictor","PredictorSpec":{"tensorflow":{"storageUri":"pvc://end-to-end-pipeline-ztsm4-model-volume/","runtimeVersion":"1.14.0","name":"kfserving-container","resources":{"limits":{"cpu":"1","memory":"2Gi"},"requests":{"cpu":"1","memory":"2Gi"}}}}} {"level":"info","ts":1620768876.4545128,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling ingress for inference service","isvc":"mnist-e2e"} {"level":"info","ts":1620768876.501067,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling inference service","apiVersion":"serving.kubeflow.org/v1beta1","isvc":"mnist-e2e"} {"level":"info","ts":1620768876.5011787,"logger":"PredictorReconciler","msg":"Reconciling Predictor","PredictorSpec":{"tensorflow":{"storageUri":"pvc://end-to-end-pipeline-ztsm4-model-volume/","runtimeVersion":"1.14.0","name":"kfserving-container","resources":{"limits":{"cpu":"1","memory":"2Gi"},"requests":{"cpu":"1","memory":"2Gi"}}}}} {"level":"info","ts":1620768876.5015552,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling ingress for inference service","isvc":"mnist-e2e"} {"level":"info","ts":1620768876.5292032,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling inference service","apiVersion":"serving.kubeflow.org/v1beta1","isvc":"mnist-e2e"} {"level":"info","ts":1620768876.5293138,"logger":"PredictorReconciler","msg":"Reconciling Predictor","PredictorSpec":{"tensorflow":{"storageUri":"pvc://end-to-end-pipeline-ztsm4-model-volume/","runtimeVersion":"1.14.0","name":"kfserving-container","resources":{"limits":{"cpu":"1","memory":"2Gi"},"requests":{"cpu":"1","memory":"2Gi"}}}}} {"level":"info","ts":1620768876.5297685,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling ingress for inference service","isvc":"mnist-e2e"} {"level":"info","ts":1620768876.6035028,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling inference service","apiVersion":"serving.kubeflow.org/v1beta1","isvc":"mnist-e2e"} {"level":"info","ts":1620768876.603637,"logger":"PredictorReconciler","msg":"Reconciling Predictor","PredictorSpec":{"tensorflow":{"storageUri":"pvc://end-to-end-pipeline-ztsm4-model-volume/","runtimeVersion":"1.14.0","name":"kfserving-container","resources":{"limits":{"cpu":"1","memory":"2Gi"},"requests":{"cpu":"1","memory":"2Gi"}}}}} {"level":"info","ts":1620768876.6040368,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling ingress for inference service","isvc":"mnist-e2e"} {"level":"info","ts":1620768876.613567,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling inference service","apiVersion":"serving.kubeflow.org/v1beta1","isvc":"mnist-e2e"} {"level":"info","ts":1620768876.6136866,"logger":"PredictorReconciler","msg":"Reconciling Predictor","PredictorSpec":{"tensorflow":{"storageUri":"pvc://end-to-end-pipeline-ztsm4-model-volume/","runtimeVersion":"1.14.0","name":"kfserving-container","resources":{"limits":{"cpu":"1","memory":"2Gi"},"requests":{"cpu":"1","memory":"2Gi"}}}}} {"level":"info","ts":1620768876.6139464,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling ingress for inference service","isvc":"mnist-e2e"} {"level":"info","ts":1620768877.5144792,"logger":"CredentialBulder","msg":"found secret","SecretName":"default-token-n7cx5"} {"level":"info","ts":1620768878.1645737,"logger":"CredentialBulder","msg":"found secret","SecretName":"default-token-n7cx5"} {"level":"info","ts":1620768905.243309,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling inference service","apiVersion":"serving.kubeflow.org/v1beta1","isvc":"mnist-e2e"} {"level":"info","ts":1620768905.2436843,"logger":"PredictorReconciler","msg":"Reconciling Predictor","PredictorSpec":{"tensorflow":{"storageUri":"pvc://end-to-end-pipeline-ztsm4-model-volume/","runtimeVersion":"1.14.0","name":"kfserving-container","resources":{"limits":{"cpu":"1","memory":"2Gi"},"requests":{"cpu":"1","memory":"2Gi"}}}}} {"level":"info","ts":1620768905.2440674,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling ingress for inference service","isvc":"mnist-e2e"} {"level":"info","ts":1620768905.2623904,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling inference service","apiVersion":"serving.kubeflow.org/v1beta1","isvc":"mnist-e2e"} {"level":"info","ts":1620768905.262604,"logger":"PredictorReconciler","msg":"Reconciling Predictor","PredictorSpec":{"tensorflow":{"storageUri":"pvc://end-to-end-pipeline-ztsm4-model-volume/","runtimeVersion":"1.14.0","name":"kfserving-container","resources":{"limits":{"cpu":"1","memory":"2Gi"},"requests":{"cpu":"1","memory":"2Gi"}}}}} {"level":"info","ts":1620768905.2630544,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling ingress for inference service","isvc":"mnist-e2e"} {"level":"info","ts":1620768905.395155,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling inference service","apiVersion":"serving.kubeflow.org/v1beta1","isvc":"mnist-e2e"} {"level":"info","ts":1620768905.3952935,"logger":"PredictorReconciler","msg":"Reconciling Predictor","PredictorSpec":{"tensorflow":{"storageUri":"pvc://end-to-end-pipeline-ztsm4-model-volume/","runtimeVersion":"1.14.0","name":"kfserving-container","resources":{"limits":{"cpu":"1","memory":"2Gi"},"requests":{"cpu":"1","memory":"2Gi"}}}}} {"level":"info","ts":1620768905.3956945,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling ingress for inference service","isvc":"mnist-e2e"} {"level":"info","ts":1620768905.4056737,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling inference service","apiVersion":"serving.kubeflow.org/v1beta1","isvc":"mnist-e2e"} {"level":"info","ts":1620768905.4058557,"logger":"PredictorReconciler","msg":"Reconciling Predictor","PredictorSpec":{"tensorflow":{"storageUri":"pvc://end-to-end-pipeline-ztsm4-model-volume/","runtimeVersion":"1.14.0","name":"kfserving-container","resources":{"limits":{"cpu":"1","memory":"2Gi"},"requests":{"cpu":"1","memory":"2Gi"}}}}} {"level":"info","ts":1620768905.406223,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling ingress for inference service","isvc":"mnist-e2e"} {"level":"info","ts":1620768905.685143,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling inference service","apiVersion":"serving.kubeflow.org/v1beta1","isvc":"mnist-e2e"} {"level":"info","ts":1620768905.685468,"logger":"PredictorReconciler","msg":"Reconciling Predictor","PredictorSpec":{"tensorflow":{"storageUri":"pvc://end-to-end-pipeline-ztsm4-model-volume/","runtimeVersion":"1.14.0","name":"kfserving-container","resources":{"limits":{"cpu":"1","memory":"2Gi"},"requests":{"cpu":"1","memory":"2Gi"}}}}} {"level":"info","ts":1620768905.686008,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling ingress for inference service","isvc":"mnist-e2e"} {"level":"info","ts":1620768905.7073374,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling inference service","apiVersion":"serving.kubeflow.org/v1beta1","isvc":"mnist-e2e"} {"level":"info","ts":1620768905.707659,"logger":"PredictorReconciler","msg":"Reconciling Predictor","PredictorSpec":{"tensorflow":{"storageUri":"pvc://end-to-end-pipeline-ztsm4-model-volume/","runtimeVersion":"1.14.0","name":"kfserving-container","resources":{"limits":{"cpu":"1","memory":"2Gi"},"requests":{"cpu":"1","memory":"2Gi"}}}}} {"level":"info","ts":1620768905.7082765,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling ingress for inference service","isvc":"mnist-e2e"} {"level":"info","ts":1620768905.828258,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling inference service","apiVersion":"serving.kubeflow.org/v1beta1","isvc":"mnist-e2e"} {"level":"info","ts":1620768905.828412,"logger":"PredictorReconciler","msg":"Reconciling Predictor","PredictorSpec":{"tensorflow":{"storageUri":"pvc://end-to-end-pipeline-ztsm4-model-volume/","runtimeVersion":"1.14.0","name":"kfserving-container","resources":{"limits":{"cpu":"1","memory":"2Gi"},"requests":{"cpu":"1","memory":"2Gi"}}}}} {"level":"info","ts":1620768905.828831,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling ingress for inference service","isvc":"mnist-e2e"} {"level":"info","ts":1620768905.8389184,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling inference service","apiVersion":"serving.kubeflow.org/v1beta1","isvc":"mnist-e2e"} {"level":"info","ts":1620768905.839035,"logger":"PredictorReconciler","msg":"Reconciling Predictor","PredictorSpec":{"tensorflow":{"storageUri":"pvc://end-to-end-pipeline-ztsm4-model-volume/","runtimeVersion":"1.14.0","name":"kfserving-container","resources":{"limits":{"cpu":"1","memory":"2Gi"},"requests":{"cpu":"1","memory":"2Gi"}}}}} {"level":"info","ts":1620768905.839494,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling ingress for inference service","isvc":"mnist-e2e"} {"level":"info","ts":1620769478.190142,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling inference service","apiVersion":"serving.kubeflow.org/v1beta1","isvc":"mnist-e2e"} {"level":"info","ts":1620769478.1902683,"logger":"PredictorReconciler","msg":"Reconciling Predictor","PredictorSpec":{"tensorflow":{"storageUri":"pvc://end-to-end-pipeline-ztsm4-model-volume/","runtimeVersion":"1.14.0","name":"kfserving-container","resources":{"limits":{"cpu":"1","memory":"2Gi"},"requests":{"cpu":"1","memory":"2Gi"}}}}} {"level":"info","ts":1620769478.190749,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling ingress for inference service","isvc":"mnist-e2e"} {"level":"info","ts":1620769478.2048721,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling inference service","apiVersion":"serving.kubeflow.org/v1beta1","isvc":"mnist-e2e"} {"level":"info","ts":1620769478.2049851,"logger":"PredictorReconciler","msg":"Reconciling Predictor","PredictorSpec":{"tensorflow":{"storageUri":"pvc://end-to-end-pipeline-ztsm4-model-volume/","runtimeVersion":"1.14.0","name":"kfserving-container","resources":{"limits":{"cpu":"1","memory":"2Gi"},"requests":{"cpu":"1","memory":"2Gi"}}}}} {"level":"info","ts":1620769478.2053647,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling ingress for inference service","isvc":"mnist-e2e"} {"level":"info","ts":1620772405.0318704,"logger":"inferenceservice-v1beta1-mutating-webhook","msg":"Defaulting InferenceService","namespace":"profile","isvc":{"tensorflow":{"storageUri":"pvc://end-to-end-pipeline-ztsm4-model-volume/","name":"","resources":{}}}} I0511 22:33:26.082805 1 request.go:645] Throttling request took 1.027992074s, request: GET:https://10.99.240.1:443/apis/acme.cert-manager.io/v1alpha2?timeout=32s {"level":"info","ts":1620772407.1123698,"logger":"inferenceservice-v1beta1-validation-webhook","msg":"validate create","name":"mnist-e2e"} {"level":"info","ts":1620772490.816721,"logger":"v1beta1Controllers.InferenceService","msg":"Deleting external resources","InferenceService":"mnist-e2e"} {"level":"info","ts":1620772490.8220859,"logger":"inferenceservice-v1beta1-mutating-webhook","msg":"Defaulting InferenceService","namespace":"profile","isvc":{"tensorflow":{"storageUri":"pvc://end-to-end-pipeline-ztsm4-model-volume/","runtimeVersion":"1.14.0","name":"kfserving-container","resources":{"limits":{"cpu":"1","memory":"2Gi"},"requests":{"cpu":"1","memory":"2Gi"}}}}} I0511 22:34:51.872799 1 request.go:645] Throttling request took 1.045716514s, request: GET:https://10.99.240.1:443/apis/kubeflow.org/v1beta1?timeout=32s {"level":"info","ts":1620772492.8906178,"logger":"inferenceservice-v1beta1-validation-webhook","msg":"validate update","name":"mnist-e2e"} {"level":"info","ts":1620772492.8906608,"logger":"inferenceservice-v1beta1-validation-webhook","msg":"validate create","name":"mnist-e2e"} {"level":"info","ts":1620772492.895843,"logger":"inferenceservice-v1beta1-mutating-webhook","msg":"Defaulting InferenceService","namespace":"profile","isvc":{"tensorflow":{"storageUri":"pvc://end-to-end-pipeline-ztsm4-model-volume/","runtimeVersion":"1.14.0","name":"kfserving-container","resources":{"limits":{"cpu":"1","memory":"2Gi"},"requests":{"cpu":"1","memory":"2Gi"}}}}} {"level":"info","ts":1620772494.9074528,"logger":"inferenceservice-v1beta1-validation-webhook","msg":"validate update","name":"mnist-e2e"} {"level":"info","ts":1620772494.9074967,"logger":"inferenceservice-v1beta1-validation-webhook","msg":"validate create","name":"mnist-e2e"} {"level":"info","ts":1620772930.9721427,"logger":"inferenceservice-v1beta1-mutating-webhook","msg":"Defaulting InferenceService","namespace":"profile","isvc":{"tensorflow":{"storageUri":"pvc://end-to-end-pipeline-ztsm4-model-volume/","name":"","resources":{}}}} I0511 22:42:12.022887 1 request.go:645] Throttling request took 1.043349721s, request: GET:https://10.99.240.1:443/apis/install.istio.io/v1alpha1?timeout=32s {"level":"info","ts":1620772933.0588965,"logger":"inferenceservice-v1beta1-validation-webhook","msg":"validate create","name":"mnist-e2e-2"} {"level":"info","ts":1620772933.1549237,"logger":"inferenceservice-v1beta1-mutating-webhook","msg":"Defaulting InferenceService","namespace":"profile","isvc":{"tensorflow":{"storageUri":"pvc://end-to-end-pipeline-ztsm4-model-volume/","runtimeVersion":"1.14.0","name":"kfserving-container","resources":{"limits":{"cpu":"1","memory":"2Gi"},"requests":{"cpu":"1","memory":"2Gi"}}}}} {"level":"info","ts":1620772935.2185495,"logger":"inferenceservice-v1beta1-validation-webhook","msg":"validate update","name":"mnist-e2e-2"} {"level":"info","ts":1620772935.2185953,"logger":"inferenceservice-v1beta1-validation-webhook","msg":"validate create","name":"mnist-e2e-2"} {"level":"info","ts":1620772935.2244377,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling inference service","apiVersion":"serving.kubeflow.org/v1beta1","isvc":"mnist-e2e-2"} {"level":"info","ts":1620772935.2247844,"logger":"PredictorReconciler","msg":"Reconciling Predictor","PredictorSpec":{"tensorflow":{"storageUri":"pvc://end-to-end-pipeline-ztsm4-model-volume/","runtimeVersion":"1.14.0","name":"kfserving-container","resources":{"limits":{"cpu":"1","memory":"2Gi"},"requests":{"cpu":"1","memory":"2Gi"}}}}} {"level":"info","ts":1620772935.2251766,"logger":"KsvcReconciler","msg":"Creating knative service","namespace":"profile","name":"mnist-e2e-2-predictor-default"} {"level":"info","ts":1620772935.4839823,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling ingress for inference service","isvc":"mnist-e2e-2"} {"level":"info","ts":1620772935.4934385,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling inference service","apiVersion":"serving.kubeflow.org/v1beta1","isvc":"mnist-e2e-2"} {"level":"info","ts":1620772935.4936116,"logger":"PredictorReconciler","msg":"Reconciling Predictor","PredictorSpec":{"tensorflow":{"storageUri":"pvc://end-to-end-pipeline-ztsm4-model-volume/","runtimeVersion":"1.14.0","name":"kfserving-container","resources":{"limits":{"cpu":"1","memory":"2Gi"},"requests":{"cpu":"1","memory":"2Gi"}}}}} {"level":"info","ts":1620772935.557141,"logger":"KsvcReconciler","msg":"knative service configuration diff (-desired, +observed):","diff":" v1.ConfigurationSpec{\n \tTemplate: v1.RevisionTemplateSpec{\n \t\tObjectMeta: {Labels: {\"component\": \"predictor\", \"serving.kubeflow.org/inferenceservice\": \"mnist-e2e-2\"}, Annotations: {\"autoscaling.knative.dev/class\": \"kpa.autoscaling.knative.dev\", \"autoscaling.knative.dev/minScale\": \"1\", \"internal.serving.kubeflow.org/storage-initializer-sourceuri\": \"pvc://end-to-end-pipeline-ztsm4-model-volume/\", \"sidecar.istio.io/inject\": \"false\"}},\n \t\tSpec: v1.RevisionSpec{\n \t\t\tPodSpec: v1.PodSpec{\n \t\t\t\t... // 28 identical fields\n \t\t\t\tReadinessGates: nil,\n \t\t\t\tRuntimeClassName: nil,\n- \t\t\t\tEnableServiceLinks: nil,\n+ \t\t\t\tEnableServiceLinks: &false,\n \t\t\t\tPreemptionPolicy: nil,\n \t\t\t\tOverhead: nil,\n \t\t\t\t... // 2 identical fields\n \t\t\t},\n \t\t\tContainerConcurrency: &0,\n \t\t\tTimeoutSeconds: &300,\n \t\t},\n \t},\n }\n"} {"level":"info","ts":1620772935.5572016,"logger":"KsvcReconciler","msg":"Updating knative service","namespace":"profile","name":"mnist-e2e-2-predictor-default"} {"level":"info","ts":1620772935.59776,"logger":"KsvcReconciler","msg":"Updating knative service","namespace":"profile","name":"mnist-e2e-2-predictor-default"} {"level":"info","ts":1620772935.6823056,"logger":"KsvcReconciler","msg":"Updating knative service","namespace":"profile","name":"mnist-e2e-2-predictor-default"} {"level":"info","ts":1620772935.7107418,"logger":"KsvcReconciler","msg":"Updating knative service","namespace":"profile","name":"mnist-e2e-2-predictor-default"} {"level":"info","ts":1620772935.7364764,"logger":"KsvcReconciler","msg":"Updating knative service","namespace":"profile","name":"mnist-e2e-2-predictor-default"} {"level":"error","ts":1620772935.7957175,"logger":"v1beta1Controllers.InferenceService","msg":"Failed to reconcile","reconciler":"<components.Predictor Value>","Name":"mnist-e2e-2","error":"fails to reconcile predictor: fails to update knative service: Operation cannot be fulfilled on services.serving.knative.dev \"mnist-e2e-2-predictor-default\": the object has been modified; please apply your changes to the latest version and try again","errorVerbose":"Operation cannot be fulfilled on services.serving.knative.dev \"mnist-e2e-2-predictor-default\": the object has been modified; please apply your changes to the latest version and try again\nfails to update knative service\ngithub.com/kubeflow/kfserving/pkg/controller/v1beta1/inferenceservice/reconcilers/knative.(KsvcReconciler).Reconcile\n\t/go/src/github.com/kubeflow/kfserving/pkg/controller/v1beta1/inferenceservice/reconcilers/knative/ksvc_reconciler.go:184\ngithub.com/kubeflow/kfserving/pkg/controller/v1beta1/inferenceservice/components.(Predictor).Reconcile\n\t/go/src/github.com/kubeflow/kfserving/pkg/controller/v1beta1/inferenceservice/components/predictor.go:117\ngithub.com/kubeflow/kfserving/pkg/controller/v1beta1/inferenceservice.(InferenceServiceReconciler).Reconcile\n\t/go/src/github.com/kubeflow/kfserving/pkg/controller/v1beta1/inferenceservice/controller.go:132\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.7.0/pkg/internal/controller/controller.go:263\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.7.0/pkg/internal/controller/controller.go:235\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(Controller).Start.func1.1\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.7.0/pkg/internal/controller/controller.go:198\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1\n\t/go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext\n\t/go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.UntilWithContext\n\t/go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:99\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1373\nfails to reconcile predictor\ngithub.com/kubeflow/kfserving/pkg/controller/v1beta1/inferenceservice/components.(Predictor).Reconcile\n\t/go/src/github.com/kubeflow/kfserving/pkg/controller/v1beta1/inferenceservice/components/predictor.go:119\ngithub.com/kubeflow/kfserving/pkg/controller/v1beta1/inferenceservice.(InferenceServiceReconciler).Reconcile\n\t/go/src/github.com/kubeflow/kfserving/pkg/controller/v1beta1/inferenceservice/controller.go:132\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.7.0/pkg/internal/controller/controller.go:263\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.7.0/pkg/internal/controller/controller.go:235\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(Controller).Start.func1.1\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.7.0/pkg/internal/controller/controller.go:198\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1\n\t/go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext\n\t/go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.UntilWithContext\n\t/go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:99\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1373","stacktrace":"github.com/go-logr/zapr.(zapLogger).Error\n\t/go/pkg/mod/github.com/go-logr/zapr@v0.2.0/zapr.go:132\ngithub.com/kubeflow/kfserving/pkg/controller/v1beta1/inferenceservice.(InferenceServiceReconciler).Reconcile\n\t/go/src/github.com/kubeflow/kfserving/pkg/controller/v1beta1/inferenceservice/controller.go:133\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.7.0/pkg/internal/controller/controller.go:263\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.7.0/pkg/internal/controller/controller.go:235\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(Controller).Start.func1.1\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.7.0/pkg/internal/controller/controller.go:198\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1\n\t/go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext\n\t/go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.UntilWithContext\n\t/go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:99"} {"level":"error","ts":1620772935.7959955,"logger":"controller-runtime.manager.controller.inferenceservice","msg":"Reconciler error","reconciler group":"serving.kubeflow.org","reconciler kind":"InferenceService","name":"mnist-e2e-2","namespace":"profile","error":"fails to reconcile component: fails to reconcile predictor: fails to update knative service: Operation cannot be fulfilled on services.serving.knative.dev \"mnist-e2e-2-predictor-default\": the object has been modified; please apply your changes to the latest version and try again","errorVerbose":"Operation cannot be fulfilled on services.serving.knative.dev \"mnist-e2e-2-predictor-default\": the object has been modified; please apply your changes to the latest version and try again\nfails to update knative service\ngithub.com/kubeflow/kfserving/pkg/controller/v1beta1/inferenceservice/reconcilers/knative.(KsvcReconciler).Reconcile\n\t/go/src/github.com/kubeflow/kfserving/pkg/controller/v1beta1/inferenceservice/reconcilers/knative/ksvc_reconciler.go:184\ngithub.com/kubeflow/kfserving/pkg/controller/v1beta1/inferenceservice/components.(Predictor).Reconcile\n\t/go/src/github.com/kubeflow/kfserving/pkg/controller/v1beta1/inferenceservice/components/predictor.go:117\ngithub.com/kubeflow/kfserving/pkg/controller/v1beta1/inferenceservice.(InferenceServiceReconciler).Reconcile\n\t/go/src/github.com/kubeflow/kfserving/pkg/controller/v1beta1/inferenceservice/controller.go:132\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.7.0/pkg/internal/controller/controller.go:263\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.7.0/pkg/internal/controller/controller.go:235\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(Controller).Start.func1.1\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.7.0/pkg/internal/controller/controller.go:198\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1\n\t/go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext\n\t/go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.UntilWithContext\n\t/go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:99\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1373\nfails to reconcile predictor\ngithub.com/kubeflow/kfserving/pkg/controller/v1beta1/inferenceservice/components.(Predictor).Reconcile\n\t/go/src/github.com/kubeflow/kfserving/pkg/controller/v1beta1/inferenceservice/components/predictor.go:119\ngithub.com/kubeflow/kfserving/pkg/controller/v1beta1/inferenceservice.(InferenceServiceReconciler).Reconcile\n\t/go/src/github.com/kubeflow/kfserving/pkg/controller/v1beta1/inferenceservice/controller.go:132\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.7.0/pkg/internal/controller/controller.go:263\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.7.0/pkg/internal/controller/controller.go:235\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(Controller).Start.func1.1\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.7.0/pkg/internal/controller/controller.go:198\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1\n\t/go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext\n\t/go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.UntilWithContext\n\t/go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:99\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1373\nfails to reconcile component\ngithub.com/kubeflow/kfserving/pkg/controller/v1beta1/inferenceservice.(InferenceServiceReconciler).Reconcile\n\t/go/src/github.com/kubeflow/kfserving/pkg/controller/v1beta1/inferenceservice/controller.go:135\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.7.0/pkg/internal/controller/controller.go:263\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.7.0/pkg/internal/controller/controller.go:235\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(Controller).Start.func1.1\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.7.0/pkg/internal/controller/controller.go:198\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1\n\t/go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext\n\t/go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.UntilWithContext\n\t/go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:99\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1373","stacktrace":"github.com/go-logr/zapr.(zapLogger).Error\n\t/go/pkg/mod/github.com/go-logr/zapr@v0.2.0/zapr.go:132\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.7.0/pkg/internal/controller/controller.go:267\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.7.0/pkg/internal/controller/controller.go:235\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(Controller).Start.func1.1\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.7.0/pkg/internal/controller/controller.go:198\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1\n\t/go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext\n\t/go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.UntilWithContext\n\t/go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:99"} {"level":"info","ts":1620772935.7962446,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling inference service","apiVersion":"serving.kubeflow.org/v1beta1","isvc":"mnist-e2e-2"} {"level":"info","ts":1620772935.7964034,"logger":"PredictorReconciler","msg":"Reconciling Predictor","PredictorSpec":{"tensorflow":{"storageUri":"pvc://end-to-end-pipeline-ztsm4-model-volume/","runtimeVersion":"1.14.0","name":"kfserving-container","resources":{"limits":{"cpu":"1","memory":"2Gi"},"requests":{"cpu":"1","memory":"2Gi"}}}}} {"level":"info","ts":1620772935.7974548,"logger":"KsvcReconciler","msg":"knative service configuration diff (-desired, +observed):","diff":" v1.ConfigurationSpec{\n \tTemplate: v1.RevisionTemplateSpec{\n \t\tObjectMeta: {Labels: {\"component\": \"predictor\", \"serving.kubeflow.org/inferenceservice\": \"mnist-e2e-2\"}, Annotations: {\"autoscaling.knative.dev/class\": \"kpa.autoscaling.knative.dev\", \"autoscaling.knative.dev/minScale\": \"1\", \"internal.serving.kubeflow.org/storage-initializer-sourceuri\": \"pvc://end-to-end-pipeline-ztsm4-model-volume/\", \"sidecar.istio.io/inject\": \"false\"}},\n \t\tSpec: v1.RevisionSpec{\n \t\t\tPodSpec: v1.PodSpec{\n \t\t\t\t... // 28 identical fields\n \t\t\t\tReadinessGates: nil,\n \t\t\t\tRuntimeClassName: nil,\n- \t\t\t\tEnableServiceLinks: nil,\n+ \t\t\t\tEnableServiceLinks: &false,\n \t\t\t\tPreemptionPolicy: nil,\n \t\t\t\tOverhead: nil,\n \t\t\t\t... // 2 identical fields\n \t\t\t},\n \t\t\tContainerConcurrency: &0,\n \t\t\tTimeoutSeconds: &300,\n \t\t},\n \t},\n }\n"} {"level":"info","ts":1620772935.7975194,"logger":"KsvcReconciler","msg":"Updating knative service","namespace":"profile","name":"mnist-e2e-2-predictor-default"} {"level":"info","ts":1620772935.8235893,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling ingress for inference service","isvc":"mnist-e2e-2"} {"level":"info","ts":1620772935.8498397,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling inference service","apiVersion":"serving.kubeflow.org/v1beta1","isvc":"mnist-e2e-2"} {"level":"info","ts":1620772935.8500657,"logger":"PredictorReconciler","msg":"Reconciling Predictor","PredictorSpec":{"tensorflow":{"storageUri":"pvc://end-to-end-pipeline-ztsm4-model-volume/","runtimeVersion":"1.14.0","name":"kfserving-container","resources":{"limits":{"cpu":"1","memory":"2Gi"},"requests":{"cpu":"1","memory":"2Gi"}}}}} {"level":"info","ts":1620772935.8545716,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling ingress for inference service","isvc":"mnist-e2e-2"} {"level":"info","ts":1620772935.8548284,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling inference service","apiVersion":"serving.kubeflow.org/v1beta1","isvc":"mnist-e2e-2"} {"level":"info","ts":1620772935.8549552,"logger":"PredictorReconciler","msg":"Reconciling Predictor","PredictorSpec":{"tensorflow":{"storageUri":"pvc://end-to-end-pipeline-ztsm4-model-volume/","runtimeVersion":"1.14.0","name":"kfserving-container","resources":{"limits":{"cpu":"1","memory":"2Gi"},"requests":{"cpu":"1","memory":"2Gi"}}}}} {"level":"info","ts":1620772935.8554227,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling ingress for inference service","isvc":"mnist-e2e-2"} {"level":"info","ts":1620772935.8848317,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling inference service","apiVersion":"serving.kubeflow.org/v1beta1","isvc":"mnist-e2e-2"} {"level":"info","ts":1620772935.8849595,"logger":"PredictorReconciler","msg":"Reconciling Predictor","PredictorSpec":{"tensorflow":{"storageUri":"pvc://end-to-end-pipeline-ztsm4-model-volume/","runtimeVersion":"1.14.0","name":"kfserving-container","resources":{"limits":{"cpu":"1","memory":"2Gi"},"requests":{"cpu":"1","memory":"2Gi"}}}}} {"level":"info","ts":1620772935.8853812,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling ingress for inference service","isvc":"mnist-e2e-2"} {"level":"info","ts":1620772935.913133,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling inference service","apiVersion":"serving.kubeflow.org/v1beta1","isvc":"mnist-e2e-2"} {"level":"info","ts":1620772935.9132566,"logger":"PredictorReconciler","msg":"Reconciling Predictor","PredictorSpec":{"tensorflow":{"storageUri":"pvc://end-to-end-pipeline-ztsm4-model-volume/","runtimeVersion":"1.14.0","name":"kfserving-container","resources":{"limits":{"cpu":"1","memory":"2Gi"},"requests":{"cpu":"1","memory":"2Gi"}}}}} {"level":"info","ts":1620772935.9136732,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling ingress for inference service","isvc":"mnist-e2e-2"} {"level":"info","ts":1620772936.0020895,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling inference service","apiVersion":"serving.kubeflow.org/v1beta1","isvc":"mnist-e2e-2"} {"level":"info","ts":1620772936.002502,"logger":"PredictorReconciler","msg":"Reconciling Predictor","PredictorSpec":{"tensorflow":{"storageUri":"pvc://end-to-end-pipeline-ztsm4-model-volume/","runtimeVersion":"1.14.0","name":"kfserving-container","resources":{"limits":{"cpu":"1","memory":"2Gi"},"requests":{"cpu":"1","memory":"2Gi"}}}}} {"level":"info","ts":1620772936.0030398,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling ingress for inference service","isvc":"mnist-e2e-2"} {"level":"info","ts":1620772936.022172,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling inference service","apiVersion":"serving.kubeflow.org/v1beta1","isvc":"mnist-e2e-2"} {"level":"info","ts":1620772936.022295,"logger":"PredictorReconciler","msg":"Reconciling Predictor","PredictorSpec":{"tensorflow":{"storageUri":"pvc://end-to-end-pipeline-ztsm4-model-volume/","runtimeVersion":"1.14.0","name":"kfserving-container","resources":{"limits":{"cpu":"1","memory":"2Gi"},"requests":{"cpu":"1","memory":"2Gi"}}}}} {"level":"info","ts":1620772936.0227535,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling ingress for inference service","isvc":"mnist-e2e-2"} {"level":"info","ts":1620772936.0678174,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling inference service","apiVersion":"serving.kubeflow.org/v1beta1","isvc":"mnist-e2e-2"} {"level":"info","ts":1620772936.0679862,"logger":"PredictorReconciler","msg":"Reconciling Predictor","PredictorSpec":{"tensorflow":{"storageUri":"pvc://end-to-end-pipeline-ztsm4-model-volume/","runtimeVersion":"1.14.0","name":"kfserving-container","resources":{"limits":{"cpu":"1","memory":"2Gi"},"requests":{"cpu":"1","memory":"2Gi"}}}}} {"level":"info","ts":1620772936.068509,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling ingress for inference service","isvc":"mnist-e2e-2"} {"level":"info","ts":1620772937.9828658,"logger":"CredentialBulder","msg":"found secret","SecretName":"default-token-n7cx5"} {"level":"info","ts":1620772938.0545385,"logger":"CredentialBulder","msg":"found secret","SecretName":"default-token-n7cx5"} {"level":"info","ts":1620772952.766172,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling inference service","apiVersion":"serving.kubeflow.org/v1beta1","isvc":"mnist-e2e-2"} {"level":"info","ts":1620772952.766301,"logger":"PredictorReconciler","msg":"Reconciling Predictor","PredictorSpec":{"tensorflow":{"storageUri":"pvc://end-to-end-pipeline-ztsm4-model-volume/","runtimeVersion":"1.14.0","name":"kfserving-container","resources":{"limits":{"cpu":"1","memory":"2Gi"},"requests":{"cpu":"1","memory":"2Gi"}}}}} {"level":"info","ts":1620772952.7665982,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling ingress for inference service","isvc":"mnist-e2e-2"} {"level":"info","ts":1620772952.8829322,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling inference service","apiVersion":"serving.kubeflow.org/v1beta1","isvc":"mnist-e2e-2"} {"level":"info","ts":1620772952.8831072,"logger":"PredictorReconciler","msg":"Reconciling Predictor","PredictorSpec":{"tensorflow":{"storageUri":"pvc://end-to-end-pipeline-ztsm4-model-volume/","runtimeVersion":"1.14.0","name":"kfserving-container","resources":{"limits":{"cpu":"1","memory":"2Gi"},"requests":{"cpu":"1","memory":"2Gi"}}}}} {"level":"info","ts":1620772952.8835628,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling ingress for inference service","isvc":"mnist-e2e-2"} {"level":"info","ts":1620772953.0701418,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling inference service","apiVersion":"serving.kubeflow.org/v1beta1","isvc":"mnist-e2e-2"} {"level":"info","ts":1620772953.0703616,"logger":"PredictorReconciler","msg":"Reconciling Predictor","PredictorSpec":{"tensorflow":{"storageUri":"pvc://end-to-end-pipeline-ztsm4-model-volume/","runtimeVersion":"1.14.0","name":"kfserving-container","resources":{"limits":{"cpu":"1","memory":"2Gi"},"requests":{"cpu":"1","memory":"2Gi"}}}}} {"level":"info","ts":1620772953.070938,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling ingress for inference service","isvc":"mnist-e2e-2"} {"level":"info","ts":1620772953.1341956,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling inference service","apiVersion":"serving.kubeflow.org/v1beta1","isvc":"mnist-e2e-2"} {"level":"info","ts":1620772953.134363,"logger":"PredictorReconciler","msg":"Reconciling Predictor","PredictorSpec":{"tensorflow":{"storageUri":"pvc://end-to-end-pipeline-ztsm4-model-volume/","runtimeVersion":"1.14.0","name":"kfserving-container","resources":{"limits":{"cpu":"1","memory":"2Gi"},"requests":{"cpu":"1","memory":"2Gi"}}}}} {"level":"info","ts":1620772953.1351116,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling ingress for inference service","isvc":"mnist-e2e-2"} {"level":"info","ts":1620772953.566679,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling inference service","apiVersion":"serving.kubeflow.org/v1beta1","isvc":"mnist-e2e-2"} {"level":"info","ts":1620772953.5668004,"logger":"PredictorReconciler","msg":"Reconciling Predictor","PredictorSpec":{"tensorflow":{"storageUri":"pvc://end-to-end-pipeline-ztsm4-model-volume/","runtimeVersion":"1.14.0","name":"kfserving-container","resources":{"limits":{"cpu":"1","memory":"2Gi"},"requests":{"cpu":"1","memory":"2Gi"}}}}} {"level":"info","ts":1620772953.5673482,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling ingress for inference service","isvc":"mnist-e2e-2"} {"level":"info","ts":1620772953.5886502,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling inference service","apiVersion":"serving.kubeflow.org/v1beta1","isvc":"mnist-e2e-2"} {"level":"info","ts":1620772953.5887856,"logger":"PredictorReconciler","msg":"Reconciling Predictor","PredictorSpec":{"tensorflow":{"storageUri":"pvc://end-to-end-pipeline-ztsm4-model-volume/","runtimeVersion":"1.14.0","name":"kfserving-container","resources":{"limits":{"cpu":"1","memory":"2Gi"},"requests":{"cpu":"1","memory":"2Gi"}}}}} {"level":"info","ts":1620772953.5891395,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling ingress for inference service","isvc":"mnist-e2e-2"} {"level":"info","ts":1620772953.6058204,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling inference service","apiVersion":"serving.kubeflow.org/v1beta1","isvc":"mnist-e2e-2"} {"level":"info","ts":1620772953.6059582,"logger":"PredictorReconciler","msg":"Reconciling Predictor","PredictorSpec":{"tensorflow":{"storageUri":"pvc://end-to-end-pipeline-ztsm4-model-volume/","runtimeVersion":"1.14.0","name":"kfserving-container","resources":{"limits":{"cpu":"1","memory":"2Gi"},"requests":{"cpu":"1","memory":"2Gi"}}}}} {"level":"info","ts":1620772953.6063447,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling ingress for inference service","isvc":"mnist-e2e-2"} {"level":"info","ts":1620772953.6549683,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling inference service","apiVersion":"serving.kubeflow.org/v1beta1","isvc":"mnist-e2e-2"} {"level":"info","ts":1620772953.6551404,"logger":"PredictorReconciler","msg":"Reconciling Predictor","PredictorSpec":{"tensorflow":{"storageUri":"pvc://end-to-end-pipeline-ztsm4-model-volume/","runtimeVersion":"1.14.0","name":"kfserving-container","resources":{"limits":{"cpu":"1","memory":"2Gi"},"requests":{"cpu":"1","memory":"2Gi"}}}}} {"level":"info","ts":1620772953.6555724,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling ingress for inference service","isvc":"mnist-e2e-2"} {"level":"info","ts":1620773538.3125443,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling inference service","apiVersion":"serving.kubeflow.org/v1beta1","isvc":"mnist-e2e-2"} {"level":"info","ts":1620773538.3127034,"logger":"PredictorReconciler","msg":"Reconciling Predictor","PredictorSpec":{"tensorflow":{"storageUri":"pvc://end-to-end-pipeline-ztsm4-model-volume/","runtimeVersion":"1.14.0","name":"kfserving-container","resources":{"limits":{"cpu":"1","memory":"2Gi"},"requests":{"cpu":"1","memory":"2Gi"}}}}} {"level":"info","ts":1620773538.3131828,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling ingress for inference service","isvc":"mnist-e2e-2"} {"level":"info","ts":1620773538.3351989,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling inference service","apiVersion":"serving.kubeflow.org/v1beta1","isvc":"mnist-e2e-2"} {"level":"info","ts":1620773538.3353457,"logger":"PredictorReconciler","msg":"Reconciling Predictor","PredictorSpec":{"tensorflow":{"storageUri":"pvc://end-to-end-pipeline-ztsm4-model-volume/","runtimeVersion":"1.14.0","name":"kfserving-container","resources":{"limits":{"cpu":"1","memory":"2Gi"},"requests":{"cpu":"1","memory":"2Gi"}}}}} {"level":"info","ts":1620773538.3357365,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling ingress for inference service","isvc":"mnist-e2e-2"} {"level":"info","ts":1620790652.2456176,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling inference service","apiVersion":"serving.kubeflow.org/v1beta1","isvc":"mnist-e2e-2"} {"level":"info","ts":1620790652.2460716,"logger":"PredictorReconciler","msg":"Reconciling Predictor","PredictorSpec":{"tensorflow":{"storageUri":"pvc://end-to-end-pipeline-ztsm4-model-volume/","runtimeVersion":"1.14.0","name":"kfserving-container","resources":{"limits":{"cpu":"1","memory":"2Gi"},"requests":{"cpu":"1","memory":"2Gi"}}}}} {"level":"info","ts":1620790652.246698,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling ingress for inference service","isvc":"mnist-e2e-2"} {"level":"info","ts":1620790652.2469358,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling inference service","apiVersion":"serving.kubeflow.org/v1beta1","isvc":"mnist-e2e-2"} {"level":"info","ts":1620790652.247036,"logger":"PredictorReconciler","msg":"Reconciling Predictor","PredictorSpec":{"tensorflow":{"storageUri":"pvc://end-to-end-pipeline-ztsm4-model-volume/","runtimeVersion":"1.14.0","name":"kfserving-container","resources":{"limits":{"cpu":"1","memory":"2Gi"},"requests":{"cpu":"1","memory":"2Gi"}}}}} {"level":"info","ts":1620790652.2472653,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling ingress for inference service","isvc":"mnist-e2e-2"} {"level":"info","ts":1620794383.3470883,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling inference service","apiVersion":"serving.kubeflow.org/v1beta1","isvc":"mnist-e2e-2"} {"level":"info","ts":1620794383.347398,"logger":"PredictorReconciler","msg":"Reconciling Predictor","PredictorSpec":{"tensorflow":{"storageUri":"pvc://end-to-end-pipeline-ztsm4-model-volume/","runtimeVersion":"1.14.0","name":"kfserving-container","resources":{"limits":{"cpu":"1","memory":"2Gi"},"requests":{"cpu":"1","memory":"2Gi"}}}}} {"level":"info","ts":1620794383.3480515,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling ingress for inference service","isvc":"mnist-e2e-2"} {"level":"info","ts":1620826109.6370797,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling inference service","apiVersion":"serving.kubeflow.org/v1beta1","isvc":"mnist-e2e-2"} {"level":"info","ts":1620826109.637784,"logger":"PredictorReconciler","msg":"Reconciling Predictor","PredictorSpec":{"tensorflow":{"storageUri":"pvc://end-to-end-pipeline-ztsm4-model-volume/","runtimeVersion":"1.14.0","name":"kfserving-container","resources":{"limits":{"cpu":"1","memory":"2Gi"},"requests":{"cpu":"1","memory":"2Gi"}}}}} {"level":"info","ts":1620826109.6397216,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling ingress for inference service","isvc":"mnist-e2e-2"} {"level":"info","ts":1620833555.0125809,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling inference service","apiVersion":"serving.kubeflow.org/v1beta1","isvc":"mnist-e2e-2"} {"level":"info","ts":1620833555.0127358,"logger":"PredictorReconciler","msg":"Reconciling Predictor","PredictorSpec":{"tensorflow":{"storageUri":"pvc://end-to-end-pipeline-ztsm4-model-volume/","runtimeVersion":"1.14.0","name":"kfserving-container","resources":{"limits":{"cpu":"1","memory":"2Gi"},"requests":{"cpu":"1","memory":"2Gi"}}}}} {"level":"info","ts":1620833555.013176,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling ingress for inference service","isvc":"mnist-e2e-2"} {"level":"info","ts":1620861567.025887,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling inference service","apiVersion":"serving.kubeflow.org/v1beta1","isvc":"mnist-e2e-2"} {"level":"info","ts":1620861567.0260932,"logger":"PredictorReconciler","msg":"Reconciling Predictor","PredictorSpec":{"tensorflow":{"storageUri":"pvc://end-to-end-pipeline-ztsm4-model-volume/","runtimeVersion":"1.14.0","name":"kfserving-container","resources":{"limits":{"cpu":"1","memory":"2Gi"},"requests":{"cpu":"1","memory":"2Gi"}}}}} {"level":"info","ts":1620861567.0266256,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling ingress for inference service","isvc":"mnist-e2e-2"} {"level":"info","ts":1620872726.678136,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling inference service","apiVersion":"serving.kubeflow.org/v1beta1","isvc":"mnist-e2e-2"} {"level":"info","ts":1620872726.6783519,"logger":"PredictorReconciler","msg":"Reconciling Predictor","PredictorSpec":{"tensorflow":{"storageUri":"pvc://end-to-end-pipeline-ztsm4-model-volume/","runtimeVersion":"1.14.0","name":"kfserving-container","resources":{"limits":{"cpu":"1","memory":"2Gi"},"requests":{"cpu":"1","memory":"2Gi"}}}}} {"level":"info","ts":1620872726.7392294,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling ingress for inference service","isvc":"mnist-e2e-2"} E0513 05:04:28.791187 1 reflector.go:127] pkg/mod/k8s.io/client-go@v0.19.2/tools/cache/reflector.go:156: Failed to watch v1alpha3.VirtualService: the server is currently unable to handle the request (get virtualservices.networking.istio.io) E0513 05:04:28.791179 1 reflector.go:127] pkg/mod/k8s.io/client-go@v0.19.2/tools/cache/reflector.go:156: Failed to watch v1beta1.InferenceService: the server is currently unable to handle the request (get inferenceservices.serving.kubeflow.org) E0513 05:04:28.791281 1 reflector.go:127] pkg/mod/k8s.io/client-go@v0.19.2/tools/cache/reflector.go:156: Failed to watch v1.Service: the server is currently unable to handle the request (get services.serving.knative.dev) E0513 05:04:28.853626 1 reflector.go:127] pkg/mod/k8s.io/client-go@v0.19.2/tools/cache/reflector.go:156: Failed to watch v1alpha1.TrainedModel: the server is currently unable to handle the request (get trainedmodels.serving.kubeflow.org) {"level":"info","ts":1620882272.3579726,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling inference service","apiVersion":"serving.kubeflow.org/v1beta1","isvc":"mnist-e2e-2"} {"level":"info","ts":1620882272.3582094,"logger":"PredictorReconciler","msg":"Reconciling Predictor","PredictorSpec":{"tensorflow":{"storageUri":"pvc://end-to-end-pipeline-ztsm4-model-volume/","runtimeVersion":"1.14.0","name":"kfserving-container","resources":{"limits":{"cpu":"1","memory":"2Gi"},"requests":{"cpu":"1","memory":"2Gi"}}}}} {"level":"info","ts":1620882272.3588848,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling ingress for inference service","isvc":"mnist-e2e-2"} {"level":"info","ts":1620882276.8830557,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling inference service","apiVersion":"serving.kubeflow.org/v1beta1","isvc":"mnist-e2e-2"} {"level":"info","ts":1620882276.883248,"logger":"PredictorReconciler","msg":"Reconciling Predictor","PredictorSpec":{"tensorflow":{"storageUri":"pvc://end-to-end-pipeline-ztsm4-model-volume/","runtimeVersion":"1.14.0","name":"kfserving-container","resources":{"limits":{"cpu":"1","memory":"2Gi"},"requests":{"cpu":"1","memory":"2Gi"}}}}} {"level":"info","ts":1620882276.883727,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling ingress for inference service","isvc":"mnist-e2e-2"} E0513 05:12:48.049634 1 reflector.go:127] pkg/mod/k8s.io/client-go@v0.19.2/tools/cache/reflector.go:156: Failed to watch v1.Service: the server is currently unable to handle the request (get services.serving.knative.dev) E0513 05:12:48.083670 1 reflector.go:127] pkg/mod/k8s.io/client-go@v0.19.2/tools/cache/reflector.go:156: Failed to watch v1alpha1.TrainedModel: the server is currently unable to handle the request (get trainedmodels.serving.kubeflow.org) E0513 05:12:48.123574 1 reflector.go:127] pkg/mod/k8s.io/client-go@v0.19.2/tools/cache/reflector.go:156: Failed to watch v1alpha3.VirtualService: the server is currently unable to handle the request (get virtualservices.networking.istio.io) E0513 05:12:48.124023 1 reflector.go:127] pkg/mod/k8s.io/client-go@v0.19.2/tools/cache/reflector.go:156: Failed to watch v1beta1.InferenceService: the server is currently unable to handle the request (get inferenceservices.serving.kubeflow.org) I0513 05:13:00.457722 1 trace.go:116] Trace[1790964291]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.19.2/tools/cache/reflector.go:156 (started: 2021-05-13 05:12:49.373439004 +0000 UTC m=+127593.803534183) (total time: 11.084169617s): Trace[1790964291]: [11.084141032s] [11.084141032s] Objects listed I0513 05:13:05.124312 1 trace.go:116] Trace[1375632262]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.19.2/tools/cache/reflector.go:156 (started: 2021-05-13 05:12:48.98320931 +0000 UTC m=+127593.413304470) (total time: 16.141065844s): Trace[1375632262]: [16.141039304s] [16.141039304s] Objects listed {"level":"info","ts":1620882785.1247594,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling inference service","apiVersion":"serving.kubeflow.org/v1beta1","isvc":"mnist-e2e-2"} {"level":"info","ts":1620882785.1249487,"logger":"PredictorReconciler","msg":"Reconciling Predictor","PredictorSpec":{"tensorflow":{"storageUri":"pvc://end-to-end-pipeline-ztsm4-model-volume/","runtimeVersion":"1.14.0","name":"kfserving-container","resources":{"limits":{"cpu":"1","memory":"2Gi"},"requests":{"cpu":"1","memory":"2Gi"}}}}} {"level":"info","ts":1620882785.126026,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling ingress for inference service","isvc":"mnist-e2e-2"} I0513 05:13:05.333913 1 trace.go:116] Trace[616970762]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.19.2/tools/cache/reflector.go:156 (started: 2021-05-13 05:12:49.217382364 +0000 UTC m=+127593.647477528) (total time: 16.11649424s): Trace[616970762]: [16.116485398s] [16.116485398s] Objects listed I0513 05:13:10.785735 1 trace.go:116] Trace[275276684]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.19.2/tools/cache/reflector.go:156 (started: 2021-05-13 05:12:49.633848559 +0000 UTC m=+127594.063943779) (total time: 21.151859663s): Trace[275276684]: [21.151806247s] [21.151806247s] Objects listed {"level":"info","ts":1620882790.785898,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling inference service","apiVersion":"serving.kubeflow.org/v1beta1","isvc":"mnist-e2e-2"} {"level":"info","ts":1620882790.78602,"logger":"PredictorReconciler","msg":"Reconciling Predictor","PredictorSpec":{"tensorflow":{"storageUri":"pvc://end-to-end-pipeline-ztsm4-model-volume/","runtimeVersion":"1.14.0","name":"kfserving-container","resources":{"limits":{"cpu":"1","memory":"2Gi"},"requests":{"cpu":"1","memory":"2Gi"}}}}} {"level":"info","ts":1620882790.7864337,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling ingress for inference service","isvc":"mnist-e2e-2"} {"level":"info","ts":1620918242.515071,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling inference service","apiVersion":"serving.kubeflow.org/v1beta1","isvc":"mnist-e2e-2"} {"level":"info","ts":1620918242.5152462,"logger":"PredictorReconciler","msg":"Reconciling Predictor","PredictorSpec":{"tensorflow":{"storageUri":"pvc://end-to-end-pipeline-ztsm4-model-volume/","runtimeVersion":"1.14.0","name":"kfserving-container","resources":{"limits":{"cpu":"1","memory":"2Gi"},"requests":{"cpu":"1","memory":"2Gi"}}}}} {"level":"info","ts":1620918242.5157752,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling ingress for inference service","isvc":"mnist-e2e-2"} {"level":"info","ts":1620918242.5163105,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling inference service","apiVersion":"serving.kubeflow.org/v1beta1","isvc":"mnist-e2e-2"} {"level":"info","ts":1620918242.5164633,"logger":"PredictorReconciler","msg":"Reconciling Predictor","PredictorSpec":{"tensorflow":{"storageUri":"pvc://end-to-end-pipeline-ztsm4-model-volume/","runtimeVersion":"1.14.0","name":"kfserving-container","resources":{"limits":{"cpu":"1","memory":"2Gi"},"requests":{"cpu":"1","memory":"2Gi"}}}}} {"level":"info","ts":1620918242.5168517,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling ingress for inference service","isvc":"mnist-e2e-2"} {"level":"info","ts":1620921962.45178,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling inference service","apiVersion":"serving.kubeflow.org/v1beta1","isvc":"mnist-e2e-2"} {"level":"info","ts":1620921962.4519224,"logger":"PredictorReconciler","msg":"Reconciling Predictor","PredictorSpec":{"tensorflow":{"storageUri":"pvc://end-to-end-pipeline-ztsm4-model-volume/","runtimeVersion":"1.14.0","name":"kfserving-container","resources":{"limits":{"cpu":"1","memory":"2Gi"},"requests":{"cpu":"1","memory":"2Gi"}}}}} {"level":"info","ts":1620921962.452611,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling ingress for inference service","isvc":"mnist-e2e-2"} {"level":"info","ts":1620944126.435673,"logger":"inferenceservice-v1beta1-mutating-webhook","msg":"Defaulting InferenceService","namespace":"profile","isvc":{"tensorflow":{"storageUri":"pvc://end-to-end-pipeline-ztsm4-model-volume/","name":"","resources":{}}}} I0513 22:15:27.486792 1 request.go:645] Throttling request took 1.045021006s, request: GET:https://10.99.240.1:443/apis/acme.cert-manager.io/v1alpha2?timeout=32s {"level":"info","ts":1620944128.7775807,"logger":"inferenceservice-v1beta1-validation-webhook","msg":"validate create","name":"mnist-e2e-2"}


3. I ran for failed predictor (revision 02)` kubectl logs mnist-e2e-2-predictor-default-00002-deployment-784bc4d56d-zznkn  --container=storage-initializer -n profile` and get:

Error from server (BadRequest): container "storage-initializer" in pod "mnist-e2e-2-predictor-default-00002-deployment-784bc4d56d-zznkn" is waiting to start: PodInitializing


4. I ran for the successful predictor (01) `kubectl logs mnist-e2e-2-predictor-default-00001-deployment-7875989b7d-hdh88  --container=kfserving-container -n profile` and get

2021-05-11 22:42:30.106496: I tensorflow_serving/model_servers/server.cc:82] Building single TensorFlow model file config: model_name: mnist-e2e-2 model_base_path: /mnt/models 2021-05-11 22:42:30.106835: I tensorflow_serving/model_servers/server_core.cc:462] Adding/updating models. 2021-05-11 22:42:30.106864: I tensorflow_serving/model_servers/server_core.cc:561] (Re-)adding model: mnist-e2e-2 2021-05-11 22:42:30.209319: I tensorflow_serving/core/basic_manager.cc:739] Successfully reserved resources to load servable {name: mnist-e2e-2 version: 1620772899} 2021-05-11 22:42:30.209400: I tensorflow_serving/core/loader_harness.cc:66] Approving load for servable version {name: mnist-e2e-2 version: 1620772899} 2021-05-11 22:42:30.209422: I tensorflow_serving/core/loader_harness.cc:74] Loading servable version {name: mnist-e2e-2 version: 1620772899} 2021-05-11 22:42:30.209455: I external/org_tensorflow/tensorflow/contrib/session_bundle/bundle_shim.cc:363] Attempting to load native SavedModelBundle in bundle-shim from: /mnt/models/1620772899 2021-05-11 22:42:30.209471: I external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:31] Reading SavedModel from: /mnt/models/1620772899 2021-05-11 22:42:30.230562: I external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:54] Reading meta graph with tags { serve } 2021-05-11 22:42:30.231653: I external/org_tensorflow/tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA 2021-05-11 22:42:30.249943: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:202] Restoring SavedModel bundle. 2021-05-11 22:42:30.459086: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:151] Running initialization op on SavedModel bundle at path: /mnt/models/1620772899 2021-05-11 22:42:30.461961: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:311] SavedModel load for tags { serve }; Status: success. Took 252482 microseconds. 2021-05-11 22:42:30.462051: I tensorflow_serving/servables/tensorflow/saved_model_warmup.cc:103] No warmup data file found at /mnt/models/1620772899/assets.extra/tf_serving_warmup_requests 2021-05-11 22:42:30.462224: I tensorflow_serving/core/loader_harness.cc:86] Successfully loaded servable version {name: mnist-e2e-2 version: 1620772899} 2021-05-11 22:42:30.465202: I tensorflow_serving/model_servers/server.cc:324] Running gRPC ModelServer at 0.0.0.0:9000 ... [warn] getaddrinfo: address family for nodename not supported 2021-05-11 22:42:30.467374: I tensorflow_serving/model_servers/server.cc:344] Exporting HTTP/REST API at:localhost:8080 ... [evhttp_server.cc : 239] RAW: Entering the event loop ...

andreyvelich commented 3 years ago

Thank you for providing this information @zijianjoy. What version are you using for KFServing manager? Is it v0.5.1 ?

It seems that storage-initializer container can't be executed and in the PodInitializing state. In addition, can you provide output of this commands:

  1. kubectl get revision mnist-e2e-2-predictor-default-00002 -n profile
  2. kubectl describe revision mnist-e2e-2-predictor-default-00002 -n profile
  3. kubectl describe pod mnist-e2e-2-predictor-default-00002-deployment-784bc4d56d-zznkn -n profile.

Also, I tried to deploy Kubeflow manifest on GCP with Dex setup: https://github.com/kubeflow/manifests/blob/master/example/kustomization.yaml. For me this example is working.

@yuzisun Any ideas why storage-initializer is in the Initializing state ?

andreyvelich commented 3 years ago

Also cc @theofpa for the KFServing question.

zijianjoy commented 3 years ago

Thank you @andreyvelich for helping to debug.

I confirm that KFServing manager version is v0.5.1.

Providing following logs:

  1. kubectl get revision mnist-e2e-predictor-default-00002 -n profile-ref10
NAME                                CONFIG NAME                   K8S SERVICE NAME                    GENERATION   READY   REASON
mnist-e2e-predictor-default-00002   mnist-e2e-predictor-default   mnist-e2e-predictor-default-00002   2            False   PodInitializing
  1. kubectl describe revision mnist-e2e-predictor-default-00002 -n profile-ref10

    Name:         mnist-e2e-predictor-default-00002
    Namespace:    profile-ref10
    Labels:       component=predictor
              serving.knative.dev/configuration=mnist-e2e-predictor-default
              serving.knative.dev/configurationGeneration=2
              serving.knative.dev/configurationUID=5b98ca8d-e1eb-45b0-9555-3cabd9644e91
              serving.knative.dev/routingState=active
              serving.knative.dev/service=mnist-e2e-predictor-default
              serving.knative.dev/serviceUID=7dc1aba0-6e58-4555-b015-acde92f53292
              serving.kubeflow.org/inferenceservice=mnist-e2e
    Annotations:  autoscaling.knative.dev/class: kpa.autoscaling.knative.dev
              autoscaling.knative.dev/minScale: 1
              internal.serving.kubeflow.org/storage-initializer-sourceuri: pvc://end-to-end-pipeline-6v2q4-model-volume/
              serving.knative.dev/creator: system:serviceaccount:kubeflow:default
              serving.knative.dev/routes: mnist-e2e-predictor-default
              serving.knative.dev/routingStateModified: 2021-05-14T20:55:41Z
              sidecar.istio.io/inject: false
    API Version:  serving.knative.dev/v1
    Kind:         Revision
    Metadata:
    Creation Timestamp:  2021-05-14T20:55:41Z
    Generation:          1
    Managed Fields:
    API Version:  serving.knative.dev/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .:
          f:autoscaling.knative.dev/class:
          f:autoscaling.knative.dev/minScale:
          f:internal.serving.kubeflow.org/storage-initializer-sourceuri:
          f:serving.knative.dev/creator:
          f:serving.knative.dev/routes:
          f:serving.knative.dev/routingStateModified:
          f:sidecar.istio.io/inject:
        f:labels:
          .:
          f:component:
          f:serving.knative.dev/configuration:
          f:serving.knative.dev/configurationGeneration:
          f:serving.knative.dev/configurationUID:
          f:serving.knative.dev/routingState:
          f:serving.knative.dev/service:
          f:serving.knative.dev/serviceUID:
          f:serving.kubeflow.org/inferenceservice:
        f:ownerReferences:
          .:
          k:{"uid":"5b98ca8d-e1eb-45b0-9555-3cabd9644e91"}:
            .:
            f:apiVersion:
            f:blockOwnerDeletion:
            f:controller:
            f:kind:
            f:name:
            f:uid:
      f:spec:
        .:
        f:containerConcurrency:
        f:containers:
        f:timeoutSeconds:
      f:status:
        .:
        f:conditions:
        f:containerStatuses:
        f:imageDigest:
        f:observedGeneration:
        f:serviceName:
    Manager:    controller
    Operation:  Update
    Time:       2021-05-14T20:55:42Z
    Owner References:
    API Version:           serving.knative.dev/v1
    Block Owner Deletion:  true
    Controller:            true
    Kind:                  Configuration
    Name:                  mnist-e2e-predictor-default
    UID:                   5b98ca8d-e1eb-45b0-9555-3cabd9644e91
    Resource Version:        166702
    Self Link:               /apis/serving.knative.dev/v1/namespaces/profile-ref10/revisions/mnist-e2e-predictor-default-00002
    UID:                     31815df6-59d4-454e-9654-6e9807b598fc
    Spec:
    Container Concurrency:  0
    Containers:
    Args:
      --port=9000
      --rest_api_port=8080
      --model_name=mnist-e2e
      --model_base_path=/mnt/models
      --rest_api_timeout_in_ms=60000
    Command:
      /usr/bin/tensorflow_model_server
    Image:  tensorflow/serving:1.14.0
    Name:   kfserving-container
    Readiness Probe:
      Success Threshold:  1
      Tcp Socket:
        Port:  0
    Resources:
      Limits:
        Cpu:     1
        Memory:  2Gi
      Requests:
        Cpu:             1
        Memory:          2Gi
    Enable Service Links:  false
    Timeout Seconds:       300
    Status:
    Conditions:
    Last Transition Time:  2021-05-14T20:55:43Z
    Message:               Requests to the target are being buffered as resources are provisioned.
    Reason:                Queued
    Severity:              Info
    Status:                Unknown
    Type:                  Active
    Last Transition Time:  2021-05-14T20:55:41Z
    Status:                Unknown
    Type:                  ContainerHealthy
    Last Transition Time:  2021-05-14T21:05:44Z
    Reason:                PodInitializing
    Status:                False
    Type:                  Ready
    Last Transition Time:  2021-05-14T21:05:44Z
    Reason:                PodInitializing
    Status:                False
    Type:                  ResourcesAvailable
    Container Statuses:
    Image Digest:       index.docker.io/tensorflow/serving@sha256:f7e59a29cbc17a6b507751cddde37bccad4407c05ebf2c13b8e6ccb7d2e9affb
    Name:               kfserving-container
    Image Digest:         index.docker.io/tensorflow/serving@sha256:f7e59a29cbc17a6b507751cddde37bccad4407c05ebf2c13b8e6ccb7d2e9affb
    Observed Generation:  1
    Service Name:         mnist-e2e-predictor-default-00002
    Events:                 <none>
  2. kubectl describe pod mnist-e2e-predictor-default-00002-deployment-86b54db9f9-2qdq5 -n profile-ref10

Name:           mnist-e2e-predictor-default-00002-deployment-86b54db9f9-2qdq5
Namespace:      profile-ref10
Priority:       0
Node:           gke-kf-ref10-default-pool-50caf419-s900/10.138.15.208
Start Time:     Fri, 14 May 2021 20:55:44 +0000
Labels:         app=mnist-e2e-predictor-default-00002
                component=predictor
                pod-template-hash=86b54db9f9
                service.istio.io/canonical-name=mnist-e2e-predictor-default
                service.istio.io/canonical-revision=mnist-e2e-predictor-default-00002
                serving.knative.dev/configuration=mnist-e2e-predictor-default
                serving.knative.dev/configurationGeneration=2
                serving.knative.dev/configurationUID=5b98ca8d-e1eb-45b0-9555-3cabd9644e91
                serving.knative.dev/revision=mnist-e2e-predictor-default-00002
                serving.knative.dev/revisionUID=31815df6-59d4-454e-9654-6e9807b598fc
                serving.knative.dev/service=mnist-e2e-predictor-default
                serving.knative.dev/serviceUID=7dc1aba0-6e58-4555-b015-acde92f53292
                serving.kubeflow.org/inferenceservice=mnist-e2e
Annotations:    autoscaling.knative.dev/class: kpa.autoscaling.knative.dev
                autoscaling.knative.dev/minScale: 1
                internal.serving.kubeflow.org/storage-initializer-sourceuri: pvc://end-to-end-pipeline-6v2q4-model-volume/
                serving.knative.dev/creator: system:serviceaccount:kubeflow:default
                sidecar.istio.io/inject: false
Status:         Pending
IP:             
IPs:            <none>
Controlled By:  ReplicaSet/mnist-e2e-predictor-default-00002-deployment-86b54db9f9
Init Containers:
  storage-initializer:
    Container ID:  
    Image:         gcr.io/kfserving/storage-initializer:v0.5.1
    Image ID:      
    Port:          <none>
    Host Port:     <none>
    Args:
      /mnt/pvc/
      /mnt/models
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:     1
      memory:  1Gi
    Requests:
      cpu:        100m
      memory:     100Mi
    Environment:  <none>
    Mounts:
      /mnt/models from kfserving-provision-location (rw)
      /mnt/pvc from kfserving-pvc-source (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-nbcbj (ro)
Containers:
  kfserving-container:
    Container ID:  
    Image:         index.docker.io/tensorflow/serving@sha256:f7e59a29cbc17a6b507751cddde37bccad4407c05ebf2c13b8e6ccb7d2e9affb
    Image ID:      
    Port:          8080/TCP
    Host Port:     0/TCP
    Command:
      /usr/bin/tensorflow_model_server
    Args:
      --port=9000
      --rest_api_port=8080
      --model_name=mnist-e2e
      --model_base_path=/mnt/models
      --rest_api_timeout_in_ms=60000
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:     1
      memory:  2Gi
    Requests:
      cpu:     1
      memory:  2Gi
    Environment:
      PORT:             8080
      K_REVISION:       mnist-e2e-predictor-default-00002
      K_CONFIGURATION:  mnist-e2e-predictor-default
      K_SERVICE:        mnist-e2e-predictor-default
    Mounts:
      /mnt/models from kfserving-provision-location (ro)
      /mnt/pvc from kfserving-pvc-source (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-nbcbj (ro)
  queue-proxy:
    Container ID:   
    Image:          gcr.io/knative-releases/knative.dev/serving/cmd/queue@sha256:6cd0c234bfbf88ac75df5243c2f9213dcc9def610414c506d418f9388187b771
    Image ID:       
    Ports:          8022/TCP, 9090/TCP, 9091/TCP, 8012/TCP
    Host Ports:     0/TCP, 0/TCP, 0/TCP, 0/TCP
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Requests:
      cpu:      25m
    Readiness:  http-get http://:8012/ delay=0s timeout=1s period=1s #success=1 #failure=3
    Startup:    exec [/ko-app/queue -probe-timeout 10m0s] delay=0s timeout=600s period=1s #success=1 #failure=1
    Environment:
      SERVING_NAMESPACE:                      profile-ref10
      SERVING_SERVICE:                        mnist-e2e-predictor-default
      SERVING_CONFIGURATION:                  mnist-e2e-predictor-default
      SERVING_REVISION:                       mnist-e2e-predictor-default-00002
      QUEUE_SERVING_PORT:                     8012
      CONTAINER_CONCURRENCY:                  0
      REVISION_TIMEOUT_SECONDS:               300
      SERVING_POD:                            mnist-e2e-predictor-default-00002-deployment-86b54db9f9-2qdq5 (v1:metadata.name)
      SERVING_POD_IP:                          (v1:status.podIP)
      SERVING_LOGGING_CONFIG:                 
      SERVING_LOGGING_LEVEL:                  
      SERVING_REQUEST_LOG_TEMPLATE:           {"httpRequest": {"requestMethod": "{{.Request.Method}}", "requestUrl": "{{js .Request.RequestURI}}", "requestSize": "{{.Request.ContentLength}}", "status": {{.Response.Code}}, "responseSize": "{{.Response.Size}}", "userAgent": "{{js .Request.UserAgent}}", "remoteIp": "{{js .Request.RemoteAddr}}", "serverIp": "{{.Revision.PodIP}}", "referer": "{{js .Request.Referer}}", "latency": "{{.Response.Latency}}s", "protocol": "{{.Request.Proto}}"}, "traceId": "{{index .Request.Header "X-B3-Traceid"}}"}
      SERVING_ENABLE_REQUEST_LOG:             false
      SERVING_REQUEST_METRICS_BACKEND:        prometheus
      TRACING_CONFIG_BACKEND:                 none
      TRACING_CONFIG_ZIPKIN_ENDPOINT:         
      TRACING_CONFIG_STACKDRIVER_PROJECT_ID:  
      TRACING_CONFIG_DEBUG:                   false
      TRACING_CONFIG_SAMPLE_RATE:             0.1
      USER_PORT:                              8080
      SYSTEM_NAMESPACE:                       knative-serving
      METRICS_DOMAIN:                         knative.dev/internal/serving
      SERVING_READINESS_PROBE:                {"tcpSocket":{"port":8080,"host":"127.0.0.1"},"successThreshold":1}
      ENABLE_PROFILING:                       false
      SERVING_ENABLE_PROBE_REQUEST_LOG:       false
      METRICS_COLLECTOR_ADDRESS:              
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-nbcbj (ro)
Conditions:
  Type              Status
  Initialized       False 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  default-token-nbcbj:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-nbcbj
    Optional:    false
  kfserving-pvc-source:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  end-to-end-pipeline-6v2q4-model-volume
    ReadOnly:   false
  kfserving-provision-location:
    Type:        EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:      
    SizeLimit:   <unset>
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason       Age                      From     Message
  ----     ------       ----                     ----     -------
  Warning  FailedMount  6m41s (x1226 over 3d6h)  kubelet  Unable to attach or mount volumes: unmounted volumes=[kfserving-pvc-source], unattached volumes=[kfserving-pvc-source kfserving-provision-location default-token-nbcbj]: timed out waiting for the condition
  Warning  FailedMount  2m10s (x455 over 3d6h)   kubelet  Unable to attach or mount volumes: unmounted volumes=[kfserving-pvc-source], unattached volumes=[default-token-nbcbj kfserving-pvc-source kfserving-provision-location]: timed out waiting for the condition
zijianjoy commented 3 years ago

Additional information:

  1. Get persistent volume of the cluster: kubectl get pv -A

NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE pvc-3cd47ad0-76eb-44ca-9b01-30d1dbfaa0fa 10Gi RWO Delete Bound profile-ref10/workspace-notebook standard 3d6h pvc-77985fef-a999-45ec-a2fb-2808a4df0675 1Gi RWO Delete Bound profile-ref10/end-to-end-pipeline-6v2q4-model-volume standard 3d6h pvc-c582da8e-1d39-45a3-9913-9726313fdf77 10Gi RWO Delete Bound kubeflow/katib-mysql standard 3d8h

2. Get Persistent Volume Claim of the cluster: `kubectl get pvc -A`

NAMESPACE NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE kubeflow katib-mysql Bound pvc-c582da8e-1d39-45a3-9913-9726313fdf77 10Gi RWO standard 3d8h profile-ref10 end-to-end-pipeline-6v2q4-model-volume Bound pvc-77985fef-a999-45ec-a2fb-2808a4df0675 1Gi RWO standard 3d6h profile-ref10 workspace-notebook Bound pvc-3cd47ad0-76eb-44ca-9b01-30d1dbfaa0fa 10Gi RWO standard 3d6h

andreyvelich commented 3 years ago

@zijianjoy Sounds like model volume can't be attached to the KFServing container. Maybe TFJob blocks it ? From my side on GCP the volume was bounded properly.

Try to remove delete_finished_tfjob=False parameter in the TFJob op:

    op = tfjob_launcher_op(
        name=tfjob_name,
        namespace=tfjob_namespace,
        chief_spec=json.dumps(tfjob_chief_spec),
        worker_spec=json.dumps(tfjob_worker_spec),
        tfjob_timeout_minutes=60)
andreyvelich commented 3 years ago

In the meantime, please try to describe PVC to check which resources are using this volume: kubectl describe pvc end-to-end-pipeline-6v2q4-model-volume -n profile-ref10

zijianjoy commented 3 years ago

Try to remove delete_finished_tfjob=False parameter in the TFJob op:

If I remove the delete_finished_tfjob parameter, it will fail at Kubeflow - Launch TFJob step because it is Exception: Timeout waiting for kubeflow.org/tfjobs mnist-e2e-2 in namespace profile-ref10 to enter one of the conditions ['Succeeded', 'Failed'].

Logs are

INFO:root:Generating tfjob template.
INFO:root:Creating kubeflow.org/tfjobs mnist-e2e-2 in namespace profile-ref10.
INFO:root:Created kubeflow.org/tfjobs mnist-e2e-2 in namespace profile-ref10.
INFO:root:Current condition of kubeflow.org/tfjobs mnist-e2e-2 in namespace profile-ref10 is Created.
...
INFO:root:Current condition of kubeflow.org/tfjobs mnist-e2e-2 in namespace profile-ref10 is Created.
INFO:root:Current condition of kubeflow.org/tfjobs mnist-e2e-2 in namespace profile-ref10 is Created.
INFO:root:Current condition of kubeflow.org/tfjobs mnist-e2e-2 in namespace profile-ref10 is Created.
INFO:root:Current condition of kubeflow.org/tfjobs mnist-e2e-2 in namespace profile-ref10 is Created.
INFO:root:Current condition of kubeflow.org/tfjobs mnist-e2e-2 in namespace profile-ref10 is Created.
INFO:root:Current condition of kubeflow.org/tfjobs mnist-e2e-2 in namespace profile-ref10 is Created.
INFO:root:Current condition of kubeflow.org/tfjobs mnist-e2e-2 in namespace profile-ref10 is Created.
...
INFO:root:Current condition of kubeflow.org/tfjobs mnist-e2e-2 in namespace profile-ref10 is Created.
Traceback (most recent call last):
  File "/ml/launch_tfjob.py", line 136, in <module>
    main()
  File "/ml/launch_tfjob.py", line 131, in main
    timeout=datetime.timedelta(minutes=args.tfjobTimeoutMinutes))
  File "/ml/launch_crd.py", line 75, in wait_for_condition
    "conditions {4}.".format(self.group, self.plural, name, namespace, expected_conditions))
Exception: Timeout waiting for kubeflow.org/tfjobs mnist-e2e-2 in namespace profile-ref10 to enter one of the conditions ['Succeeded', 'Failed'].
zijianjoy commented 3 years ago

In the meantime, please try to describe PVC to check which resources are using this volume: kubectl describe pvc end-to-end-pipeline-6v2q4-model-volume -n profile-ref10

Sharing the result of this pvc:


Name:          end-to-end-pipeline-6v2q4-model-volume
Namespace:     profile-ref10
StorageClass:  standard
Status:        Bound
Volume:        pvc-77985fef-a999-45ec-a2fb-2808a4df0675
Labels:        <none>
Annotations:   pv.kubernetes.io/bind-completed: yes
               pv.kubernetes.io/bound-by-controller: yes
               volume.beta.kubernetes.io/storage-provisioner: kubernetes.io/gce-pd
Finalizers:    [kubernetes.io/pvc-protection]
Capacity:      1Gi
Access Modes:  RWO
VolumeMode:    Filesystem
Used By:       mnist-e2e-2-chief-0
               mnist-e2e-chief-0
               mnist-e2e-predictor-default-00001-deployment-6fc58cc5b4-7d64k
               mnist-e2e-predictor-default-00002-deployment-86b54db9f9-2qdq5
Events:        <none>
zijianjoy commented 3 years ago

Sharing the compiled pipeline definition of this sample:

https://gist.github.com/zijianjoy/e96c8cc6533e21e48cf193746831b089

andreyvelich commented 3 years ago

Try to remove delete_finished_tfjob=False parameter in the TFJob op:

If I remove the delete_finished_tfjob parameter, it will fail at Kubeflow - Launch TFJob step because it is Exception: Timeout waiting for kubeflow.org/tfjobs mnist-e2e-2 in namespace profile-ref10 to enter one of the conditions ['Succeeded', 'Failed'].

Logs are

INFO:root:Generating tfjob template.
INFO:root:Creating kubeflow.org/tfjobs mnist-e2e-2 in namespace profile-ref10.
INFO:root:Created kubeflow.org/tfjobs mnist-e2e-2 in namespace profile-ref10.
INFO:root:Current condition of kubeflow.org/tfjobs mnist-e2e-2 in namespace profile-ref10 is Created.
...
INFO:root:Current condition of kubeflow.org/tfjobs mnist-e2e-2 in namespace profile-ref10 is Created.
INFO:root:Current condition of kubeflow.org/tfjobs mnist-e2e-2 in namespace profile-ref10 is Created.
INFO:root:Current condition of kubeflow.org/tfjobs mnist-e2e-2 in namespace profile-ref10 is Created.
INFO:root:Current condition of kubeflow.org/tfjobs mnist-e2e-2 in namespace profile-ref10 is Created.
INFO:root:Current condition of kubeflow.org/tfjobs mnist-e2e-2 in namespace profile-ref10 is Created.
INFO:root:Current condition of kubeflow.org/tfjobs mnist-e2e-2 in namespace profile-ref10 is Created.
INFO:root:Current condition of kubeflow.org/tfjobs mnist-e2e-2 in namespace profile-ref10 is Created.
...
INFO:root:Current condition of kubeflow.org/tfjobs mnist-e2e-2 in namespace profile-ref10 is Created.
Traceback (most recent call last):
  File "/ml/launch_tfjob.py", line 136, in <module>
    main()
  File "/ml/launch_tfjob.py", line 131, in main
    timeout=datetime.timedelta(minutes=args.tfjobTimeoutMinutes))
  File "/ml/launch_crd.py", line 75, in wait_for_condition
    "conditions {4}.".format(self.group, self.plural, name, namespace, expected_conditions))
Exception: Timeout waiting for kubeflow.org/tfjobs mnist-e2e-2 in namespace profile-ref10 to enter one of the conditions ['Succeeded', 'Failed'].

Can you double check why TFJob is in the Created status ? Describe the mnist-e2e-2 TFJob, check the pod logs.

zijianjoy commented 3 years ago

List of pods in kubeflow and user profile namespaces:

NAMESPACE         NAME                                                             READY   STATUS              RESTARTS   AGE
kubeflow          admission-webhook-deployment-6748884cff-r6d76                    1/1     Running             0          5d1h
kubeflow          cache-deployer-deployment-7d787d9dd9-6bcj6                       2/2     Running             1          5d1h
kubeflow          cache-server-5948c9cd48-mp5s5                                    2/2     Running             1          5d1h
kubeflow          centraldashboard-cc9574b95-dg6w6                                 1/1     Running             0          5d1h
kubeflow          cloud-endpoints-controller-67c7b4b7f7-59nqk                      1/1     Running             0          5d1h
kubeflow          cloudsqlproxy-d7f4c7b86-bp84k                                    2/2     Running             1          5d1h
kubeflow          jupyter-web-app-deployment-6f75d7796-fzqpg                       1/1     Running             0          5d1h
kubeflow          katib-controller-6d4d54cff7-gvvw7                                1/1     Running             0          5d1h
kubeflow          katib-db-manager-697d8ddd69-l7z7l                                1/1     Running             0          5d1h
kubeflow          katib-mysql-6fd88c9c89-nhggp                                     1/1     Running             0          5d1h
kubeflow          katib-ui-74d89ffbfb-sd8zr                                        1/1     Running             0          5d1h
kubeflow          kfserving-controller-manager-0                                   2/2     Running             0          5d1h
kubeflow          kubeflow-pipelines-profile-controller-d87fff898-84ljp            1/1     Running             0          5d1h
kubeflow          metacontroller-0                                                 1/1     Running             0          5d1h
kubeflow          metadata-envoy-deployment-7db956c6f5-dlcmd                       1/1     Running             0          5d1h
kubeflow          metadata-grpc-deployment-6d85777b7c-lxc8f                        2/2     Running             2          5d1h
kubeflow          metadata-writer-789b57984d-lrtc8                                 2/2     Running             7          5d1h
kubeflow          minio-644b8b6bc5-qlf9p                                           2/2     Running             0          5d1h
kubeflow          ml-pipeline-74769d8bf5-pw54l                                     2/2     Running             0          5d1h
kubeflow          ml-pipeline-persistenceagent-db88647d5-h2vss                     2/2     Running             0          2d2h
kubeflow          ml-pipeline-scheduledworkflow-898c4b88f-4pnd5                    2/2     Running             0          5d1h
kubeflow          ml-pipeline-ui-8597f9fb9-2tbnm                                   2/2     Running             0          5d1h
kubeflow          ml-pipeline-viewer-crd-67dbdc8d86-ffckf                          2/2     Running             1          5d1h
kubeflow          ml-pipeline-visualizationserver-57566d4c45-ktzvg                 2/2     Running             0          5d1h
kubeflow          notebook-controller-deployment-65d666b7b-nhs4v                   1/1     Running             0          5d1h
kubeflow          profiles-deployment-588c449459-tlhd2                             2/2     Running             0          5d1h
kubeflow          pytorch-operator-b89ff96cc-ztrrp                                 2/2     Running             2          5d1h
kubeflow          tensorboard-controller-controller-manager-7b955d8f69-b8t4q       3/3     Running             14         5d1h
kubeflow          tensorboards-web-app-deployment-64b569989d-jgc2n                 1/1     Running             0          5d1h
kubeflow          tf-job-operator-6fd4d8d869-72dng                                 1/1     Running             12         5d1h
kubeflow          volumes-web-app-deployment-7df5bd657c-gcjdt                      1/1     Running             0          5d1h
kubeflow          workflow-controller-6855d48fd7-x5nf6                             2/2     Running             2          5d1h
profile-ref10     end-to-end-pipeline-79ngw-1429492901                             0/1     Completed           0          3h46m
profile-ref10     end-to-end-pipeline-79ngw-2227434625                             0/2     Error               0          3h46m
profile-ref10     end-to-end-pipeline-79ngw-2393645640                             0/1     Completed           0          3h46m
profile-ref10     end-to-end-pipeline-79ngw-3092938776                             0/1     Completed           0          3h46m
profile-ref10     end-to-end-pipeline-9s5fb-1797256841                             0/1     Completed           0          138m
profile-ref10     end-to-end-pipeline-9s5fb-2312731595                             0/2     Completed           0          135m
profile-ref10     end-to-end-pipeline-9s5fb-2742465402                             0/2     Error               0          135m
profile-ref10     end-to-end-pipeline-9s5fb-3438513990                             0/2     Completed           0          138m
profile-ref10     end-to-end-pipeline-lk2q6-1163786406                             0/2     Completed           0          3h40m
profile-ref10     end-to-end-pipeline-lk2q6-1403031006                             0/1     Completed           0          3h43m
profile-ref10     end-to-end-pipeline-lk2q6-2126139547                             0/2     Completed           0          3h43m
profile-ref10     end-to-end-pipeline-lk2q6-362872075                              0/2     Error               0          3h39m
profile-ref10     gpu-smoke-check-l5ctv-833024161                                  0/2     Pending             0          5d1h
profile-ref10     ml-pipeline-ui-artifact-dd9b8f4bf-t6hfp                          2/2     Running             0          5d1h
profile-ref10     ml-pipeline-visualizationserver-85b5c575db-w8r97                 2/2     Running             0          5d1h
profile-ref10     mnist-e2e-2-chief-0                                              0/1     ContainerCreating   0          3h39m
profile-ref10     mnist-e2e-2-random-6b7d58d778-pl9rk                              1/1     Running             0          3h43m
profile-ref10     mnist-e2e-2-worker-0                                             1/1     Running             0          3h39m
profile-ref10     mnist-e2e-chief-0                                                0/1     Completed           0          4d23h
profile-ref10     mnist-e2e-predictor-default-00001-deployment-6fc58cc5b4-7d64k    2/2     Running             0          4d23h
profile-ref10     mnist-e2e-predictor-default-00002-deployment-86b54db9f9-2qdq5    0/2     Init:0/1            0          4d23h
profile-ref10     mnist-e2e-random-69f8755785-qcc7x                                1/1     Running             0          4d23h
profile-ref10     mnist-e2e-ui-chief-0                                             0/1     ContainerCreating   0          135m
profile-ref10     mnist-e2e-ui-random-684c78955d-2885t                             1/1     Running             0          138m
profile-ref10     mnist-e2e-ui-worker-0                                            1/1     Running             0          135m
profile-ref10     notebook-0                                                       2/2     Running             0          4d23h

Describe TFJob pod: kubectl describe pod end-to-end-pipeline-lk2q6-362872075 -n profile-ref10

Name:         end-to-end-pipeline-lk2q6-362872075
Namespace:    profile-ref10
Priority:     0
Node:         gke-kf-ref10-default-pool-50caf419-db2h/10.138.15.207
Start Time:   Wed, 19 May 2021 16:46:17 +0000
Labels:       pipelines.kubeflow.org/cache_enabled=true
              pipelines.kubeflow.org/cache_id=
              pipelines.kubeflow.org/metadata_context_id=71
              pipelines.kubeflow.org/metadata_execution_id=89
              pipelines.kubeflow.org/metadata_written=true
              workflows.argoproj.io/completed=true
              workflows.argoproj.io/workflow=end-to-end-pipeline-lk2q6
Annotations:  pipelines.kubeflow.org/arguments.parameters:
                {"ActiveDeadlineSeconds": "-1", "BackoffLimit": "-1", "Chief Spec": "{\"replicas\": 1, \"restartPolicy\": \"OnFailure\", \"template\": {\"...
              pipelines.kubeflow.org/component_ref:
                {"digest": "98fd1829c0a5ff1b36c53bc2a8d6482d33a7a5dc042e2ee661840be06065c325", "url": "https://raw.githubusercontent.com/kubeflow/pipeline...
              pipelines.kubeflow.org/component_spec:
                {"description": "Kubeflow TFJob launcher", "implementation": {"container": {"args": ["--name", {"inputValue": "Name"}, "--namespace", {"in...
              pipelines.kubeflow.org/execution_cache_key: a2f1350d52463289d452d514614c5c768bba05ae87b40ac0cee77e07d28b663c
              pipelines.kubeflow.org/metadata_input_artifact_ids: []
              pipelines.kubeflow.org/metadata_output_artifact_ids:
                [{"id": 160, "name": "main-logs", "uri": "minio://jamxl-kfp-dev-kfp/artifacts/end-to-end-pipeline-lk2q6/end-to-end-pipeline-lk2q6-36287207...
              sidecar.istio.io/inject: false
              workflows.argoproj.io/node-name: end-to-end-pipeline-lk2q6.kubeflow-launch-tfjob
              workflows.argoproj.io/outputs:
                {"artifacts":[{"name":"main-logs","archiveLogs":true,"s3":{"endpoint":"minio-service.kubeflow:9000","bucket":"jamxl-kfp-dev-kfp","insecure...
              workflows.argoproj.io/template:
                {"name":"kubeflow-launch-tfjob","arguments":{},"inputs":{"parameters":[{"name":"convert-katib-results-Output","value":"--tf-learning-rate=...
Status:       Failed
IP:           10.108.0.127
IPs:
  IP:           10.108.0.127
Controlled By:  Workflow/end-to-end-pipeline-lk2q6
Containers:
  wait:
    Container ID:  docker://7e06f6e7269576c0ea141c69301189a44919863e7a595b2b3c8f096fd4167d0d
    Image:         gcr.io/ml-pipeline/argoexec:v2.12.9-license-compliance
    Image ID:      docker-pullable://gcr.io/ml-pipeline/argoexec@sha256:9b512a260a215d65054685511dbd867184b0041721c51126c8ef478b4f32bd5e
    Port:          <none>
    Host Port:     <none>
    Command:
      argoexec
      wait
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Wed, 19 May 2021 16:46:18 +0000
      Finished:     Wed, 19 May 2021 17:45:54 +0000
    Ready:          False
    Restart Count:  0
    Environment:
      ARGO_POD_NAME:  end-to-end-pipeline-lk2q6-362872075 (v1:metadata.name)
    Mounts:
      /argo/podmetadata from podmetadata (rw)
      /argo/secret/mlpipeline-minio-artifact from mlpipeline-minio-artifact (ro)
      /var/run/docker.sock from docker-sock (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from default-editor-token-9bv2r (ro)
  main:
    Container ID:  docker://ff7e0163d8e7936a5140fb765480e7a33b99e3e4491102c9579daf6b8e685290
    Image:         nikenano/launchernew:latest
    Image ID:      docker-pullable://nikenano/launchernew@sha256:79c9599a50959541854856b4069529a7bc0fa312eebcec2ee0ab31d756b5eb9c
    Port:          <none>
    Host Port:     <none>
    Command:
      python
      /ml/launch_tfjob.py
    Args:
      --name
      mnist-e2e-2
      --namespace
      profile-ref10
      --version
      v1
      --activeDeadlineSeconds
      -1
      --backoffLimit
      -1
      --cleanPodPolicy
      Running
      --ttlSecondsAfterFinished
      -1
      --psSpec
      {}
      --workerSpec
      {"replicas": 1, "restartPolicy": "OnFailure", "template": {"metadata": {"annotations": {"sidecar.istio.io/inject": "false"}}, "spec": {"containers": [{"name": "tensorflow", "image": "docker.io/liuhougangxa/tf-estimator-mnist", "command": ["sh", "-c"], "args": ["python /opt/model.py --tf-export-dir=/mnt/export --tf-train-steps=200 --tf-learning-rate=0.04953975758073499 --tf-batch-size=88"]}]}}}
      --chiefSpec
      {"replicas": 1, "restartPolicy": "OnFailure", "template": {"metadata": {"annotations": {"sidecar.istio.io/inject": "false"}}, "spec": {"containers": [{"name": "tensorflow", "image": "docker.io/liuhougangxa/tf-estimator-mnist", "command": ["sh", "-c"], "args": ["python /opt/model.py --tf-export-dir=/mnt/export --tf-train-steps=200 --tf-learning-rate=0.04953975758073499 --tf-batch-size=88"], "volumeMounts": [{"mountPath": "/mnt/export", "name": "model-volume"}]}], "volumes": [{"name": "model-volume", "persistentVolumeClaim": {"claimName": "end-to-end-pipeline-6v2q4-model-volume"}}]}}}
      --evaluatorSpec
      {}
      --tfjobTimeoutMinutes
      60
      --deleteAfterDone
      True
    State:          Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Wed, 19 May 2021 16:46:19 +0000
      Finished:     Wed, 19 May 2021 17:45:53 +0000
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-editor-token-9bv2r (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  podmetadata:
    Type:  DownwardAPI (a volume populated by information about the pod)
    Items:
      metadata.annotations -> annotations
  docker-sock:
    Type:          HostPath (bare host directory volume)
    Path:          /var/run/docker.sock
    HostPathType:  Socket
  mlpipeline-minio-artifact:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  mlpipeline-minio-artifact
    Optional:    false
  default-editor-token-9bv2r:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-editor-token-9bv2r
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:          <none>

pod logs:

wait container: kubectl logs end-to-end-pipeline-lk2q6-362872075 -n profile-ref10 --container=wait


time="2021-05-19T16:46:18.387Z" level=info msg="Starting Workflow Executor" version=v2.12.9
time="2021-05-19T16:46:18.437Z" level=info msg="Creating a docker executor"
time="2021-05-19T16:46:18.438Z" level=info msg="Executor (version: v2.12.9, build_date: 2021-02-16T22:51:48Z) initialized (pod: profile-ref10/end-to-end-pipeline-lk2q6-362872075) with template:\n{\"name\":\"kubeflow-launch-tfjob\",\"arguments\":{},\"inputs\":{\"parameters\":[{\"name\":\"convert-katib-results-Output\",\"value\":\"--tf-learning-rate=0.04953975758073499 --tf-batch-size=88\"},{\"name\":\"model-volume-name\",\"value\":\"end-to-end-pipeline-6v2q4-model-volume\"},{\"name\":\"name\",\"value\":\"mnist-e2e-2\"},{\"name\":\"namespace\",\"value\":\"profile-ref10\"},{\"name\":\"training_steps\",\"value\":\"200\"}]},\"outputs\":{},\"metadata\":{\"annotations\":{\"pipelines.kubeflow.org/arguments.parameters\":\"{\\\"ActiveDeadlineSeconds\\\": \\\"-1\\\", \\\"BackoffLimit\\\": \\\"-1\\\", \\\"Chief Spec\\\": \\\"{\\\\\\\"replicas\\\\\\\": 1, \\\\\\\"restartPolicy\\\\\\\": \\\\\\\"OnFailure\\\\\\\", \\\\\\\"template\\\\\\\": {\\\\\\\"metadata\\\\\\\": {\\\\\\\"annotations\\\\\\\": {\\\\\\\"sidecar.istio.io/inject\\\\\\\": \\\\\\\"false\\\\\\\"}}, \\\\\\\"spec\\\\\\\": {\\\\\\\"containers\\\\\\\": [{\\\\\\\"name\\\\\\\": \\\\\\\"tensorflow\\\\\\\", \\\\\\\"image\\\\\\\": \\\\\\\"docker.io/liuhougangxa/tf-estimator-mnist\\\\\\\", \\\\\\\"command\\\\\\\": [\\\\\\\"sh\\\\\\\", \\\\\\\"-c\\\\\\\"], \\\\\\\"args\\\\\\\": [\\\\\\\"python /opt/model.py --tf-export-dir=/mnt/export --tf-train-steps=200 --tf-learning-rate=0.04953975758073499 --tf-batch-size=88\\\\\\\"], \\\\\\\"volumeMounts\\\\\\\": [{\\\\\\\"mountPath\\\\\\\": \\\\\\\"/mnt/export\\\\\\\", \\\\\\\"name\\\\\\\": \\\\\\\"model-volume\\\\\\\"}]}], \\\\\\\"volumes\\\\\\\": [{\\\\\\\"name\\\\\\\": \\\\\\\"model-volume\\\\\\\", \\\\\\\"persistentVolumeClaim\\\\\\\": {\\\\\\\"claimName\\\\\\\": \\\\\\\"end-to-end-pipeline-6v2q4-model-volume\\\\\\\"}}]}}}\\\", \\\"CleanPodPolicy\\\": \\\"Running\\\", \\\"Delete Finished Tfjob\\\": \\\"True\\\", \\\"Evaluator Spec\\\": \\\"{}\\\", \\\"Name\\\": \\\"mnist-e2e-2\\\", \\\"Namespace\\\": \\\"profile-ref10\\\", \\\"PS Spec\\\": \\\"{}\\\", \\\"Tfjob Timeout Minutes\\\": \\\"60\\\", \\\"Version\\\": \\\"v1\\\", \\\"Worker Spec\\\": \\\"{\\\\\\\"replicas\\\\\\\": 1, \\\\\\\"restartPolicy\\\\\\\": \\\\\\\"OnFailure\\\\\\\", \\\\\\\"template\\\\\\\": {\\\\\\\"metadata\\\\\\\": {\\\\\\\"annotations\\\\\\\": {\\\\\\\"sidecar.istio.io/inject\\\\\\\": \\\\\\\"false\\\\\\\"}}, \\\\\\\"spec\\\\\\\": {\\\\\\\"containers\\\\\\\": [{\\\\\\\"name\\\\\\\": \\\\\\\"tensorflow\\\\\\\", \\\\\\\"image\\\\\\\": \\\\\\\"docker.io/liuhougangxa/tf-estimator-mnist\\\\\\\", \\\\\\\"command\\\\\\\": [\\\\\\\"sh\\\\\\\", \\\\\\\"-c\\\\\\\"], \\\\\\\"args\\\\\\\": [\\\\\\\"python /opt/model.py --tf-export-dir=/mnt/export --tf-train-steps=200 --tf-learning-rate=0.04953975758073499 --tf-batch-size=88\\\\\\\"]}]}}}\\\", \\\"ttl Seconds After Finished\\\": \\\"-1\\\"}\",\"pipelines.kubeflow.org/component_ref\":\"{\\\"digest\\\": \\\"98fd1829c0a5ff1b36c53bc2a8d6482d33a7a5dc042e2ee661840be06065c325\\\", \\\"url\\\": \\\"https://raw.githubusercontent.com/kubeflow/pipelines/master/components/kubeflow/launcher/component.yaml\\\"}\",\"pipelines.kubeflow.org/component_spec\":\"{\\\"description\\\": \\\"Kubeflow TFJob launcher\\\", \\\"implementation\\\": {\\\"container\\\": {\\\"args\\\": [\\\"--name\\\", {\\\"inputValue\\\": \\\"Name\\\"}, \\\"--namespace\\\", {\\\"inputValue\\\": \\\"Namespace\\\"}, \\\"--version\\\", {\\\"inputValue\\\": \\\"Version\\\"}, \\\"--activeDeadlineSeconds\\\", {\\\"inputValue\\\": \\\"ActiveDeadlineSeconds\\\"}, \\\"--backoffLimit\\\", {\\\"inputValue\\\": \\\"BackoffLimit\\\"}, \\\"--cleanPodPolicy\\\", {\\\"inputValue\\\": \\\"CleanPodPolicy\\\"}, \\\"--ttlSecondsAfterFinished\\\", {\\\"inputValue\\\": \\\"ttl Seconds After Finished\\\"}, \\\"--psSpec\\\", {\\\"inputValue\\\": \\\"PS Spec\\\"}, \\\"--workerSpec\\\", {\\\"inputValue\\\": \\\"Worker Spec\\\"}, \\\"--chiefSpec\\\", {\\\"inputValue\\\": \\\"Chief Spec\\\"}, \\\"--evaluatorSpec\\\", {\\\"inputValue\\\": \\\"Evaluator Spec\\\"}, \\\"--tfjobTimeoutMinutes\\\", {\\\"inputValue\\\": \\\"Tfjob Timeout Minutes\\\"}, \\\"--deleteAfterDone\\\", {\\\"inputValue\\\": \\\"Delete Finished Tfjob\\\"}], \\\"command\\\": [\\\"python\\\", \\\"/ml/launch_tfjob.py\\\"], \\\"image\\\": \\\"nikenano/launchernew:latest\\\"}}, \\\"inputs\\\": [{\\\"description\\\": \\\"TFJob name.\\\", \\\"name\\\": \\\"Name\\\", \\\"type\\\": \\\"String\\\"}, {\\\"default\\\": \\\"kubeflow\\\", \\\"description\\\": \\\"TFJob namespace.\\\", \\\"name\\\": \\\"Namespace\\\", \\\"type\\\": \\\"String\\\"}, {\\\"default\\\": \\\"v1\\\", \\\"description\\\": \\\"TFJob version.\\\", \\\"name\\\": \\\"Version\\\", \\\"type\\\": \\\"String\\\"}, {\\\"default\\\": -1, \\\"description\\\": \\\"Specifies the duration (in seconds) since startTime during which the job can remain active before it is terminated. Must be a positive integer. This setting applies only to pods where restartPolicy is OnFailure or Always.\\\", \\\"name\\\": \\\"ActiveDeadlineSeconds\\\", \\\"type\\\": \\\"Integer\\\"}, {\\\"default\\\": -1, \\\"description\\\": \\\"Number of retries before marking this job as failed.\\\", \\\"name\\\": \\\"BackoffLimit\\\", \\\"type\\\": \\\"Integer\\\"}, {\\\"default\\\": -1, \\\"description\\\": \\\"Defines the TTL for cleaning up finished TFJobs.\\\", \\\"name\\\": \\\"ttl Seconds After Finished\\\", \\\"type\\\": \\\"Integer\\\"}, {\\\"default\\\": \\\"Running\\\", \\\"description\\\": \\\"Defines the policy for cleaning up pods after the TFJob completes.\\\", \\\"name\\\": \\\"CleanPodPolicy\\\", \\\"type\\\": \\\"String\\\"}, {\\\"default\\\": \\\"{}\\\", \\\"description\\\": \\\"TFJob ps replicaSpecs.\\\", \\\"name\\\": \\\"PS Spec\\\", \\\"type\\\": \\\"JsonObject\\\"}, {\\\"default\\\": \\\"{}\\\", \\\"description\\\": \\\"TFJob worker replicaSpecs.\\\", \\\"name\\\": \\\"Worker Spec\\\", \\\"type\\\": \\\"JsonObject\\\"}, {\\\"default\\\": \\\"{}\\\", \\\"description\\\": \\\"TFJob chief replicaSpecs.\\\", \\\"name\\\": \\\"Chief Spec\\\", \\\"type\\\": \\\"JsonObject\\\"}, {\\\"default\\\": \\\"{}\\\", \\\"description\\\": \\\"TFJob evaluator replicaSpecs.\\\", \\\"name\\\": \\\"Evaluator Spec\\\", \\\"type\\\": \\\"JsonObject\\\"}, {\\\"default\\\": 1440, \\\"description\\\": \\\"Time in minutes to wait for the TFJob to complete.\\\", \\\"name\\\": \\\"Tfjob Timeout Minutes\\\", \\\"type\\\": \\\"Integer\\\"}, {\\\"default\\\": \\\"True\\\", \\\"description\\\": \\\"Whether to delete the tfjob after it is finished.\\\", \\\"name\\\": \\\"Delete Finished Tfjob\\\", \\\"type\\\": \\\"Bool\\\"}], \\\"name\\\": \\\"Kubeflow - Launch TFJob\\\"}\",\"sidecar.istio.io/inject\":\"false\"},\"labels\":{\"pipelines.kubeflow.org/cache_enabled\":\"true\"}},\"container\":{\"name\":\"\",\"image\":\"nikenano/launchernew:latest\",\"command\":[\"python\",\"/ml/launch_tfjob.py\"],\"args\":[\"--name\",\"mnist-e2e-2\",\"--namespace\",\"profile-ref10\",\"--version\",\"v1\",\"--activeDeadlineSeconds\",\"-1\",\"--backoffLimit\",\"-1\",\"--cleanPodPolicy\",\"Running\",\"--ttlSecondsAfterFinished\",\"-1\",\"--psSpec\",\"{}\",\"--workerSpec\",\"{\\\"replicas\\\": 1, \\\"restartPolicy\\\": \\\"OnFailure\\\", \\\"template\\\": {\\\"metadata\\\": {\\\"annotations\\\": {\\\"sidecar.istio.io/inject\\\": \\\"false\\\"}}, \\\"spec\\\": {\\\"containers\\\": [{\\\"name\\\": \\\"tensorflow\\\", \\\"image\\\": \\\"docker.io/liuhougangxa/tf-estimator-mnist\\\", \\\"command\\\": [\\\"sh\\\", \\\"-c\\\"], \\\"args\\\": [\\\"python /opt/model.py --tf-export-dir=/mnt/export --tf-train-steps=200 --tf-learning-rate=0.04953975758073499 --tf-batch-size=88\\\"]}]}}}\",\"--chiefSpec\",\"{\\\"replicas\\\": 1, \\\"restartPolicy\\\": \\\"OnFailure\\\", \\\"template\\\": {\\\"metadata\\\": {\\\"annotations\\\": {\\\"sidecar.istio.io/inject\\\": \\\"false\\\"}}, \\\"spec\\\": {\\\"containers\\\": [{\\\"name\\\": \\\"tensorflow\\\", \\\"image\\\": \\\"docker.io/liuhougangxa/tf-estimator-mnist\\\", \\\"command\\\": [\\\"sh\\\", \\\"-c\\\"], \\\"args\\\": [\\\"python /opt/model.py --tf-export-dir=/mnt/export --tf-train-steps=200 --tf-learning-rate=0.04953975758073499 --tf-batch-size=88\\\"], \\\"volumeMounts\\\": [{\\\"mountPath\\\": \\\"/mnt/export\\\", \\\"name\\\": \\\"model-volume\\\"}]}], \\\"volumes\\\": [{\\\"name\\\": \\\"model-volume\\\", \\\"persistentVolumeClaim\\\": {\\\"claimName\\\": \\\"end-to-end-pipeline-6v2q4-model-volume\\\"}}]}}}\",\"--evaluatorSpec\",\"{}\",\"--tfjobTimeoutMinutes\",\"60\",\"--deleteAfterDone\",\"True\"],\"resources\":{}},\"archiveLocation\":{\"archiveLogs\":true,\"s3\":{\"endpoint\":\"minio-service.kubeflow:9000\",\"bucket\":\"jamxl-kfp-dev-kfp\",\"insecure\":true,\"accessKeySecret\":{\"name\":\"mlpipeline-minio-artifact\",\"key\":\"accesskey\"},\"secretKeySecret\":{\"name\":\"mlpipeline-minio-artifact\",\"key\":\"secretkey\"},\"key\":\"artifacts/end-to-end-pipeline-lk2q6/end-to-end-pipeline-lk2q6-362872075\"}}}"
time="2021-05-19T16:46:18.438Z" level=info msg="Waiting on main container"
time="2021-05-19T16:46:20.149Z" level=info msg="main container started with container ID: ff7e0163d8e7936a5140fb765480e7a33b99e3e4491102c9579daf6b8e685290"
time="2021-05-19T16:46:20.149Z" level=info msg="Starting annotations monitor"
time="2021-05-19T16:46:20.156Z" level=info msg="docker wait ff7e0163d8e7936a5140fb765480e7a33b99e3e4491102c9579daf6b8e685290"
time="2021-05-19T16:46:20.156Z" level=info msg="Starting deadline monitor"
time="2021-05-19T16:46:30.156Z" level=info msg="/argo/podmetadata/annotations updated"
time="2021-05-19T16:51:18.438Z" level=info msg="Alloc=5003 TotalAlloc=15663 Sys=70848 NumGC=7 Goroutines=12"
time="2021-05-19T16:56:18.438Z" level=info msg="Alloc=5005 TotalAlloc=15675 Sys=70848 NumGC=9 Goroutines=12"
time="2021-05-19T17:01:18.438Z" level=info msg="Alloc=5003 TotalAlloc=15688 Sys=70848 NumGC=12 Goroutines=12"
time="2021-05-19T17:06:18.438Z" level=info msg="Alloc=5005 TotalAlloc=15701 Sys=70848 NumGC=14 Goroutines=12"
time="2021-05-19T17:11:18.438Z" level=info msg="Alloc=5003 TotalAlloc=15713 Sys=70848 NumGC=17 Goroutines=12"
time="2021-05-19T17:16:18.438Z" level=info msg="Alloc=5005 TotalAlloc=15726 Sys=70848 NumGC=19 Goroutines=12"
time="2021-05-19T17:21:18.438Z" level=info msg="Alloc=5004 TotalAlloc=15739 Sys=70848 NumGC=22 Goroutines=12"
time="2021-05-19T17:26:18.438Z" level=info msg="Alloc=5005 TotalAlloc=15751 Sys=70848 NumGC=24 Goroutines=12"
time="2021-05-19T17:31:18.438Z" level=info msg="Alloc=5004 TotalAlloc=15763 Sys=70848 NumGC=27 Goroutines=12"
time="2021-05-19T17:36:18.438Z" level=info msg="Alloc=5005 TotalAlloc=15776 Sys=70848 NumGC=29 Goroutines=12"
time="2021-05-19T17:41:18.438Z" level=info msg="Alloc=4855 TotalAlloc=15789 Sys=70848 NumGC=32 Goroutines=10"
time="2021-05-19T17:45:54.143Z" level=info msg="Main container completed"
time="2021-05-19T17:45:54.144Z" level=info msg="No Script output reference in workflow. Capturing script output ignored"
time="2021-05-19T17:45:54.144Z" level=info msg="Capturing script exit code"
time="2021-05-19T17:45:54.144Z" level=info msg="Annotations monitor stopped"
time="2021-05-19T17:45:54.185Z" level=info msg="Saving logs"
time="2021-05-19T17:45:54.185Z" level=info msg="[docker logs ff7e0163d8e7936a5140fb765480e7a33b99e3e4491102c9579daf6b8e685290]"
time="2021-05-19T17:45:54.217Z" level=info msg="S3 Save path: /tmp/argo/outputs/logs/main.log, key: artifacts/end-to-end-pipeline-lk2q6/end-to-end-pipeline-lk2q6-362872075/main.log"
time="2021-05-19T17:45:54.218Z" level=info msg="Creating minio client minio-service.kubeflow:9000 using static credentials"
time="2021-05-19T17:45:54.218Z" level=info msg="Saving from /tmp/argo/outputs/logs/main.log to s3 (endpoint: minio-service.kubeflow:9000, bucket: jamxl-kfp-dev-kfp, key: artifacts/end-to-end-pipeline-lk2q6/end-to-end-pipeline-lk2q6-362872075/main.log)"
time="2021-05-19T17:45:54.467Z" level=info msg="No output parameters"
time="2021-05-19T17:45:54.467Z" level=info msg="No output artifacts"
time="2021-05-19T17:45:54.467Z" level=info msg="Annotating pod with output"
time="2021-05-19T17:45:54.490Z" level=info msg="Killing sidecars"
time="2021-05-19T17:45:54.495Z" level=info msg="Alloc=6429 TotalAlloc=17373 Sys=70848 NumGC=34 Goroutines=10"

main container: Same as Exception: Timeout waiting for kubeflow.org/tfjobs mnist-e2e-2 in namespace profile-ref10 to enter one of the conditions ['Succeeded', 'Failed'].

andreyvelich commented 3 years ago

Please can you also describe the TFJob pods, not a Kubeflow Pipeline task? kubectl describe pod -n profile-ref10 mnist-e2e-ui-chief-0 And TFJob: kubectl describe tfjob mnist-e2e -n profile-ref10

zijianjoy commented 3 years ago

I see, thanks for clarification!

Sharing the TFJob pod: kubectl describe pod -n profile-ref10 mnist-e2e-ui-chief-0

Name:           mnist-e2e-ui-chief-0
Namespace:      profile-ref10
Priority:       0
Node:           gke-kf-ref10-default-pool-50caf419-s900/10.138.15.208
Start Time:     Wed, 19 May 2021 18:10:40 +0000
Labels:         group-name=kubeflow.org
                job-name=mnist-e2e-ui
                job-role=master
                replica-index=0
                replica-type=chief
Annotations:    sidecar.istio.io/inject: false
Status:         Pending
IP:             
IPs:            <none>
Controlled By:  TFJob/mnist-e2e-ui
Containers:
  tensorflow:
    Container ID:  
    Image:         docker.io/liuhougangxa/tf-estimator-mnist
    Image ID:      
    Port:          2222/TCP
    Host Port:     0/TCP
    Command:
      sh
      -c
    Args:
      python /opt/model.py --tf-export-dir=/mnt/export --tf-train-steps=200 --tf-learning-rate=0.03830876327649775 --tf-batch-size=96
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Environment:
      TF_CONFIG:  {"cluster":{"chief":["mnist-e2e-ui-chief-0.profile-ref10.svc:2222"],"worker":["mnist-e2e-ui-worker-0.profile-ref10.svc:2222"]},"task":{"type":"chief","index":0},"environment":"cloud"}
    Mounts:
      /mnt/export from model-volume (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-nbcbj (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  model-volume:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  end-to-end-pipeline-6v2q4-model-volume
    ReadOnly:   false
  default-token-nbcbj:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-nbcbj
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason       Age                   From     Message
  ----     ------       ----                  ----     -------
  Warning  FailedMount  6m4s (x58 over 171m)  kubelet  Unable to attach or mount volumes: unmounted volumes=[model-volume], unattached volumes=[model-volume default-token-nbcbj]: timed out waiting for the condition
  Warning  FailedMount  93s (x18 over 155m)   kubelet  Unable to attach or mount volumes: unmounted volumes=[model-volume], unattached volumes=[default-token-nbcbj model-volume]: timed out waiting for the condition

And TFJob: kubectl describe tfjob mnist-e2e-ui -n profile-ref10

Name:         mnist-e2e-ui
Namespace:    profile-ref10
Labels:       <none>
Annotations:  <none>
API Version:  kubeflow.org/v1
Kind:         TFJob
Metadata:
  Creation Timestamp:  2021-05-19T18:10:40Z
  Generation:          1
  Managed Fields:
    API Version:  kubeflow.org/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:spec:
        .:
        f:tfReplicaSpecs:
          .:
          f:Chief:
            .:
            f:replicas:
            f:restartPolicy:
            f:template:
              .:
              f:metadata:
                .:
                f:annotations:
                  .:
                  f:sidecar.istio.io/inject:
              f:spec:
                .:
                f:volumes:
          f:Worker:
            .:
            f:replicas:
            f:restartPolicy:
            f:template:
              .:
              f:metadata:
                .:
                f:annotations:
                  .:
                  f:sidecar.istio.io/inject:
              f:spec:
    Manager:      OpenAPI-Generator
    Operation:    Update
    Time:         2021-05-19T18:10:40Z
    API Version:  kubeflow.org/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:spec:
        f:runPolicy:
          .:
          f:cleanPodPolicy:
        f:successPolicy:
        f:tfReplicaSpecs:
          f:Chief:
            f:template:
              f:metadata:
                f:creationTimestamp:
              f:spec:
                f:containers:
          f:Worker:
            f:template:
              f:metadata:
                f:creationTimestamp:
              f:spec:
                f:containers:
      f:status:
        .:
        f:conditions:
        f:replicaStatuses:
          .:
          f:Chief:
          f:Worker:
            .:
            f:active:
        f:startTime:
    Manager:         tf-operator.v1
    Operation:       Update
    Time:            2021-05-19T18:10:43Z
  Resource Version:  8688633
  Self Link:         /apis/kubeflow.org/v1/namespaces/profile-ref10/tfjobs/mnist-e2e-ui
  UID:               8be5d7c4-7012-49cc-8565-5327fafd67e4
Spec:
  Clean Pod Policy:  Running
  Tf Replica Specs:
    Chief:
      Replicas:        1
      Restart Policy:  OnFailure
      Template:
        Metadata:
          Annotations:
            sidecar.istio.io/inject:  false
        Spec:
          Containers:
            Args:
              python /opt/model.py --tf-export-dir=/mnt/export --tf-train-steps=200 --tf-learning-rate=0.03830876327649775 --tf-batch-size=96
            Command:
              sh
              -c
            Image:  docker.io/liuhougangxa/tf-estimator-mnist
            Name:   tensorflow
            Volume Mounts:
              Mount Path:  /mnt/export
              Name:        model-volume
          Volumes:
            Name:  model-volume
            Persistent Volume Claim:
              Claim Name:  end-to-end-pipeline-6v2q4-model-volume
    Worker:
      Replicas:        1
      Restart Policy:  OnFailure
      Template:
        Metadata:
          Annotations:
            sidecar.istio.io/inject:  false
        Spec:
          Containers:
            Args:
              python /opt/model.py --tf-export-dir=/mnt/export --tf-train-steps=200 --tf-learning-rate=0.03830876327649775 --tf-batch-size=96
            Command:
              sh
              -c
            Image:  docker.io/liuhougangxa/tf-estimator-mnist
            Name:   tensorflow
Status:
  Conditions:
    Last Transition Time:  2021-05-19T18:10:40Z
    Last Update Time:      2021-05-19T18:10:40Z
    Message:               TFJob mnist-e2e-ui is created.
    Reason:                TFJobCreated
    Status:                True
    Type:                  Created
  Replica Statuses:
    Chief:
    Worker:
      Active:  1
  Start Time:  2021-05-19T18:10:40Z
Events:        <none>
andreyvelich commented 3 years ago

It seems that the problem with attaching volume to your pods. It takes long time. Do you have any specific setup or constraint for your GCP cluster ?

Also, check this: https://github.com/kubernetes/kubernetes/issues/67014#issuecomment-827629299. Which version of GKE are you using ? Can you try to upgrade to 1.19 ?

zijianjoy commented 3 years ago

I am not aware of any specific constraint on my GCP cluster. I am using https://www.kubeflow.org/docs/distributions/gke/deploy/deploy-cli/ to deploy Kubeflow cluster.

I am not sure if upgrading Kubernetes version can help with this issue. Would you like to help me understand? My current Kubernetes version is 1.18.17-gke.100.

What is the Kubernetes version you are using which can finish volume mounting successfully?

andreyvelich commented 3 years ago

I am using GKE cluster with 4 Nodes. Version is: 1.19.9-gke.1900. FYI, to clean-up all resources from this example, run this:

kubectl delete wf -n profile-ref10 --all
kubectl delete tfjob -n profile-ref10 --all
kubectl delete experiment -n profile-ref10 --all
kubectl delete isvc -n profile-ref10 --all
zijianjoy commented 3 years ago

Thank you @andreyvelich , it is useful command to clean up past resources, maybe we can add them to the E2E example documentation.

I would like to share update from my side:

Environment

I had Kubernetes cluster with version1.18.17-gke.100, and machine type is e2-standard-8, and 2 nodes in default node pool. I have disabled autoscaling. And the image type Container-Optimized OS with Docker (cos).

Based on the recommendation, I have upgraded cluster master version and nodepool node version to 1.19.9-gke.1400. Note that in this case, the image type doesn't change.

Action

Instead of calling KFP to create a run directly, I used the following command to compile to yaml file, and upload to KFP from UI. (I set delete_finished_tfjob=False):

from kfp import compiler 
compiler.Compiler(mode=kfp.dsl.pipeline_executionmode_v2).compile(mnist_pipeline, 'mnist_pipeline_not_delete_tfjob.yaml')

With this approach, I am able to finish pipeline run and created inference service successfully!

However, I am not able to build connection to predict endpoint as the final step of E2E example. Error message is

ApiException: (401)
Reason: Unauthorized
HTTP response headers: HTTPHeaderDict({'X-Goog-IAP-Generated-Response': 'true', 'Content-Length': '45', 'Content-Type': 'text/html; charset=UTF-8', 'Date': 'Fri, 21 May 2021 17:17:30 GMT', 'Alt-Svc': 'clear'})
HTTP response body: OpenID Connect token expired: JWT has expired

I guess it is about the IAP policy requirement for sending request to GCP endpoint.

Limitation

Note that Kubeflow Pipelines actually doesn't work on Kubernetes 1.19: https://github.com/kubeflow/pipelines/issues/5714. The reason that pipeline can run in the Action section above is because image type doesn't change by node version upgrade. If we create a Kubernetes cluster starting at 1.19, then image type will be Container-Optimized OS with Containerd. Then KFP will not work in this case. We might need to pin image type in GCP distribution if we upgrade k8s version to 1.19 for this E2E example. cc @Bobgy

Another caveat is that calling KFP to create run directly from notebook still fails with the same message as described at the beginning of this thread https://github.com/kubeflow/gcp-blueprints/issues/271#issuecomment-839200055, even when I have upgraded to k8s 1.19. Below is the command I use to initialize run:

# Run the Kubeflow Pipeline in the user's namespace.
kfp_client=kfp.Client(host='https://kf-ref9.endpoints.jamxl-kfp-dev.cloud.goog/pipeline',
    client_id='xxxxx.apps.googleusercontent.com',
    other_client_id='xxx.apps.googleusercontent.com',
    other_client_secret='xxx')
run_id = kfp_client.create_run_from_pipeline_func(mnist_pipeline, namespace=namespace, arguments={}).run_id
print("Run ID: ", run_id)

What is the difference which caused the pipeline run failure when uploading from UI vs in-cluster notebook?

andreyvelich commented 3 years ago

Thank you for providing this information @zijianjoy.

The reason that pipeline can run in the Action section above is because image type doesn't change by node version upgrade.

Yes, for my GCP cluster I am also using Container-Optimized OS with Docker (cos) image type, since I upgraded GKE from older version.

However, I am not able to build connection to predict endpoint as the final step of E2E example. Error message is

I believe, if you want to call KFServing inference outside of K8s cluster (not from Kubeflow Notebooks) you should use another URL. @yuzisun Do we have an example how to call KFServing inference which is deployed on GCP ?

What is the difference which caused the pipeline run failure when uploading from UI vs in-cluster notebook?

It should not be any differences. Did you give your Kubeflow Notebook access to KFP (see this comment: https://github.com/kubeflow/pipelines/issues/5138#issuecomment-797003710) and extend KFP SDK with token credentials changes from this PR: https://github.com/kubeflow/pipelines/pull/5287 ? Basically, to run Pipelines in Multi-user mode.

Do we have any differences in Volume creation step (dsl.VolumeOp()) if Pipeline is executed from Notebook or from the UI ?