Deployment failed with sidecar injection in GKE + Anthos Service Mesh

jinserk commented 2 years ago

Describe the bug

I'm trying to use GKE private cluster with standard config, with the Anthos service mesh managed profile. However, when I try to deploy "Iris" model for the test, the deployment stuck in calling "storage.googleapis.com":

$ kubectl get all -n test
NAME                                                  READY   STATUS     RESTARTS   AGE
pod/iris-model-default-0-classifier-dfb586df4-ltt29   0/3     Init:1/2   0          30s

NAME                                    TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)             AGE
service/iris-model-default              ClusterIP   xxx.xxx.65.194   <none>        8000/TCP,5001/TCP   30s
service/iris-model-default-classifier   ClusterIP   xxx.xxx.79.206   <none>        9000/TCP,9500/TCP   30s

NAME                                              READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/iris-model-default-0-classifier   0/1     1            0           31s

NAME                                                        DESIRED   CURRENT   READY   AGE
replicaset.apps/iris-model-default-0-classifier-dfb586df4   1         1         0       31s

$ kubectl logs -f -n test pod/iris-model-default-0-classifier-dfb586df4-ltt29 -c classifier-model-initializer
2022/11/19 20:59:34 NOTICE: Config file "/.rclone.conf" not found - using defaults
2022/11/19 20:59:57 ERROR : GCS bucket seldon-models path v1.15.0-dev/sklearn/iris: error reading source root directory: Get "https://storage.googleapis.com/storage/v1/b/seldon-models/o?alt=json&delimiter=%2F&maxResults=1000&prefix=v1.15.0-dev%2Fsklearn%2Firis%2F&prettyPrint=false": dial tcp 199.36.153.8:443: connect: connection refused
2022/11/19 20:59:57 ERROR : Attempt 1/3 failed with 1 errors and: Get "https://storage.googleapis.com/storage/v1/b/seldon-models/o?alt=json&delimiter=%2F&maxResults=1000&prefix=v1.15.0-dev%2Fsklearn%2Firis%2F&prettyPrint=false": dial tcp 199.36.153.8:443: connect: connection refused
2022/11/19 21:00:17 ERROR : GCS bucket seldon-models path v1.15.0-dev/sklearn/iris: error reading source root directory: Get "https://storage.googleapis.com/storage/v1/b/seldon-models/o?alt=json&delimiter=%2F&maxResults=1000&prefix=v1.15.0-dev%2Fsklearn%2Firis%2F&prettyPrint=false": dial tcp 199.36.153.8:443: connect: connection refused
2022/11/19 21:00:17 ERROR : Attempt 2/3 failed with 1 errors and: Get "https://storage.googleapis.com/storage/v1/b/seldon-models/o?alt=json&delimiter=%2F&maxResults=1000&prefix=v1.15.0-dev%2Fsklearn%2Firis%2F&prettyPrint=false": dial tcp 199.36.153.8:443: connect: connection refused

I used "sidecar injection" with the namespace labeling:

kubectl create namespace test
kubectl label namespace test istio-injection- istio.io/rev=asm-managed --overwrite
kubectl annotate --overwrite namespace test mesh.cloud.google.com/proxy='{"managed":"true"}'

When I don't use "sidecar injection", the deployment was quite successful. But in this case I need to inject the proxy manually to get the accesss to the model API. I wonder if this is the intended operation or not.

To reproduce

create a GKE private standard cluster
enable Anthos Service Mesh (https://cloud.google.com/service-mesh/docs/managed/provision-managed-anthos-service-mesh)
enable 4443 webhook to the private master firewall rule
install seldon with helm + istio.enable=true
create a test namespace

label + annotate

kubectl label namespace test istio-injection- istio.io/rev=asm-managed --overwrite
kubectl annotate --overwrite namespace test mesh.cloud.google.com/proxy='{"managed":"true"}'

deploy iris model

apiVersion: machinelearning.seldon.io/v1
kind: SeldonDeployment
metadata:
name: iris-model
spec:
name: iris
protocol: v2
predictors:
- graph:
  implementation: SKLEARN_SERVER
  modelUri: gs://seldon-models/v1.15.0-dev/sklearn/iris
  name: classifier
name: default
replicas: 1

Expected behaviour

Deployment should be successful regardless of sidecar injection

Environment

Cloud Provider: GKE

Kubernetes Cluster Version [Output of kubectl version]

WARNING: This version information is deprecated and will be replaced with the output from kubectl version --short.  Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.4", GitCommit:"872a965c6c6526caa949f0c6ac028ef7aff3fb78", GitTreeState:"archive", BuildDate:"2022-11-10T22:18:49Z", GoVersion:"go1.19.3", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v4.5.7
Server Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.12-gke.100", GitCommit:"fd16f64da04c784b85909576a3f7abf4ed49b949", GitTreeState:"clean", BuildDate:"2022-09-22T09:23:33Z", GoVersion:"go1.17.13b7", Compiler:"gc", Platform:"linux/amd64"}
WARNING: version difference between client (1.25) and server (1.23) exceeds the supported minor version skew of +/-1

Deployed Seldon System Images: [Output of kubectl get --namespace seldon-system deploy seldon-controller-manager -o yaml | grep seldonio]

$ kubectl get --namespace seldon-system deploy seldon-controller-manager -o yaml  | grep seldonio
      value: docker.io/seldonio/seldon-core-executor:1.14.1
    image: docker.io/seldonio/seldon-core-operator:1.14.1

Model Details

Refer to the above description

jinserk commented 2 years ago

I found this link (https://github.com/StatCan/daaas/issues/798) that they have a similar issue.

jinserk commented 2 years ago

During the deployment in progress, the istio-proxy is in the status of PodInintializing, which means the istio layer doesn't being activated in the pod. Thus the egress settings would not work. Is it possible for the model initializer to wait until the istio-proxy being activated?

jinserk commented 2 years ago

Is it related to:

? I've tried this

apiVersion: v1
kind: ConfigMap
metadata:
  name: istio-asm-managed
  namespace: istio-system
data:
  mesh: |-
    defaultConfig:
      holdApplicationUntilProxyStarts: true

but it doesn't make any changes.

rinnadom commented 2 years ago

I'm seeing a similar issue on OpenShift (on-prem, using RedHat OpenShift Service Mesh). The example sklearn SeldonDeployment deploys just fine without sidecar injection. With sidecar injection, the classifier-model-initializer initContainer hangs indefinitely. Seems like it's failing to reach out to the model location.

I also tried adding the holdApplicationUntilProxyStarts annotation, and didn't have any luck with it either.

RafalSkolasinski commented 2 years ago

There is not much we can do there. The issue is with Istio and side cars as they basically block any outgoing traffic from init containers and we use init containers to fetch the model before the actual inference microservice starts.

There were some solutions I remember from Istio documentation to basically annotate deployments allowing the outgoing traffic but I don't remember exactly where I've seen it.

This will be solved in Core V2 as we do not make use of init containers there.

ukclivecox commented 1 year ago

Closing as answered.

SeldonIO / seldon-core