SeldonIO / seldon-core

An MLOps framework to package, deploy, monitor and manage thousands of production machine learning models
https://www.seldon.io/tech/products/core/
Other
4.4k stars 832 forks source link

Deployment failed with sidecar injection in GKE + Anthos Service Mesh #4450

Closed jinserk closed 1 year ago

jinserk commented 2 years ago

Describe the bug

I'm trying to use GKE private cluster with standard config, with the Anthos service mesh managed profile. However, when I try to deploy "Iris" model for the test, the deployment stuck in calling "storage.googleapis.com":

$ kubectl get all -n test
NAME                                                  READY   STATUS     RESTARTS   AGE
pod/iris-model-default-0-classifier-dfb586df4-ltt29   0/3     Init:1/2   0          30s

NAME                                    TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)             AGE
service/iris-model-default              ClusterIP   xxx.xxx.65.194   <none>        8000/TCP,5001/TCP   30s
service/iris-model-default-classifier   ClusterIP   xxx.xxx.79.206   <none>        9000/TCP,9500/TCP   30s

NAME                                              READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/iris-model-default-0-classifier   0/1     1            0           31s

NAME                                                        DESIRED   CURRENT   READY   AGE
replicaset.apps/iris-model-default-0-classifier-dfb586df4   1         1         0       31s
$ kubectl logs -f -n test pod/iris-model-default-0-classifier-dfb586df4-ltt29 -c classifier-model-initializer
2022/11/19 20:59:34 NOTICE: Config file "/.rclone.conf" not found - using defaults
2022/11/19 20:59:57 ERROR : GCS bucket seldon-models path v1.15.0-dev/sklearn/iris: error reading source root directory: Get "https://storage.googleapis.com/storage/v1/b/seldon-models/o?alt=json&delimiter=%2F&maxResults=1000&prefix=v1.15.0-dev%2Fsklearn%2Firis%2F&prettyPrint=false": dial tcp 199.36.153.8:443: connect: connection refused
2022/11/19 20:59:57 ERROR : Attempt 1/3 failed with 1 errors and: Get "https://storage.googleapis.com/storage/v1/b/seldon-models/o?alt=json&delimiter=%2F&maxResults=1000&prefix=v1.15.0-dev%2Fsklearn%2Firis%2F&prettyPrint=false": dial tcp 199.36.153.8:443: connect: connection refused
2022/11/19 21:00:17 ERROR : GCS bucket seldon-models path v1.15.0-dev/sklearn/iris: error reading source root directory: Get "https://storage.googleapis.com/storage/v1/b/seldon-models/o?alt=json&delimiter=%2F&maxResults=1000&prefix=v1.15.0-dev%2Fsklearn%2Firis%2F&prettyPrint=false": dial tcp 199.36.153.8:443: connect: connection refused
2022/11/19 21:00:17 ERROR : Attempt 2/3 failed with 1 errors and: Get "https://storage.googleapis.com/storage/v1/b/seldon-models/o?alt=json&delimiter=%2F&maxResults=1000&prefix=v1.15.0-dev%2Fsklearn%2Firis%2F&prettyPrint=false": dial tcp 199.36.153.8:443: connect: connection refused

I used "sidecar injection" with the namespace labeling:

kubectl create namespace test
kubectl label namespace test istio-injection- istio.io/rev=asm-managed --overwrite
kubectl annotate --overwrite namespace test mesh.cloud.google.com/proxy='{"managed":"true"}'

When I don't use "sidecar injection", the deployment was quite successful. But in this case I need to inject the proxy manually to get the accesss to the model API. I wonder if this is the intended operation or not.

To reproduce

  1. create a GKE private standard cluster
  2. enable Anthos Service Mesh (https://cloud.google.com/service-mesh/docs/managed/provision-managed-anthos-service-mesh)
  3. enable 4443 webhook to the private master firewall rule
  4. install seldon with helm + istio.enable=true
  5. create a test namespace
  6. label + annotate
    kubectl label namespace test istio-injection- istio.io/rev=asm-managed --overwrite
    kubectl annotate --overwrite namespace test mesh.cloud.google.com/proxy='{"managed":"true"}'
  7. deploy iris model
    apiVersion: machinelearning.seldon.io/v1
    kind: SeldonDeployment
    metadata:
    name: iris-model
    spec:
    name: iris
    protocol: v2
    predictors:
    - graph:
      implementation: SKLEARN_SERVER
      modelUri: gs://seldon-models/v1.15.0-dev/sklearn/iris
      name: classifier
    name: default
    replicas: 1

Expected behaviour

Deployment should be successful regardless of sidecar injection

Environment

Model Details

Refer to the above description

jinserk commented 2 years ago

I found this link (https://github.com/StatCan/daaas/issues/798) that they have a similar issue.

jinserk commented 2 years ago

image During the deployment in progress, the istio-proxy is in the status of PodInintializing, which means the istio layer doesn't being activated in the pod. Thus the egress settings would not work. Is it possible for the model initializer to wait until the istio-proxy being activated?

jinserk commented 2 years ago

Is it related to:

? I've tried this

apiVersion: v1
kind: ConfigMap
metadata:
  name: istio-asm-managed
  namespace: istio-system
data:
  mesh: |-
    defaultConfig:
      holdApplicationUntilProxyStarts: true

but it doesn't make any changes.

rinnadom commented 2 years ago

I'm seeing a similar issue on OpenShift (on-prem, using RedHat OpenShift Service Mesh). The example sklearn SeldonDeployment deploys just fine without sidecar injection. With sidecar injection, the classifier-model-initializer initContainer hangs indefinitely. Seems like it's failing to reach out to the model location.

I also tried adding the holdApplicationUntilProxyStarts annotation, and didn't have any luck with it either.

RafalSkolasinski commented 2 years ago

There is not much we can do there. The issue is with Istio and side cars as they basically block any outgoing traffic from init containers and we use init containers to fetch the model before the actual inference microservice starts.

There were some solutions I remember from Istio documentation to basically annotate deployments allowing the outgoing traffic but I don't remember exactly where I've seen it.

This will be solved in Core V2 as we do not make use of init containers there.

ukclivecox commented 1 year ago

Closing as answered.