SeldonIO / seldon-core

An MLOps framework to package, deploy, monitor and manage thousands of production machine learning models
https://www.seldon.io/tech/products/core/
Other
4.4k stars 832 forks source link

Internal error occurred: failed calling webhook "v1alpha2.vseldondeployment.kb.io" #3201

Closed shudhanshh12 closed 2 years ago

shudhanshh12 commented 3 years ago

Error from server (InternalError): error when creating "deployment.yaml": Internal error occurred: failed calling webhook "v1alpha2.vseldondeployment.kb.io": Post "https://seldon-webhook-service.seldon-system.svc:443/validate-machinelearning-seldon-io-v1alpha2-seldondeployment?timeout=30s": context deadline exceeded

Describe the bug

I'm getting issues while deploying the new SeldonDeployment.

To reproduce

istioctl install --set profile=default -y

kubectl get pods -n istio-system

NAME READY STATUS RESTARTS AGE istio-ingressgateway-5cb85cb9fc-nwb9d 1/1 Running 0 12h istiod-68f469d854-jm7m2 1/1 Running 0 13h

I have deployed the seldon operator by helm,

helm install seldon-core seldon-core-operator --repo https://storage.googleapis.com/seldon-charts --set istio.enabled=true --set usageMetrics.enabled=true --namespace seldon-system --set crd.create=true --set certManager.enabled=true

kubectl get pods -n seldon-system

NAME READY STATUS RESTARTS AGE seldon-controller-manager-6dbb9fbd87-4rtct 1/1 Running 0 47m

kubectl create -f deployment.yaml -n seldon-system

Error from server (InternalError): error when creating "deployment.yaml": Internal error occurred: failed calling webhook "v1alpha2.vseldondeployment.kb.io": Post "https://seldon-webhook-service.seldon-system.svc:443/validate-machinelearning-seldon-io-v1alpha2-seldondeployment?timeout=30s": dial tcp 10.80.0.14:4443: i/o timeout

Expected behaviour

this should create the deployment

Environment

GKE with istio manually installed

kubectl get --namespace seldon-system deploy seldon-controller-manager -o yaml | grep seldonio

      value: docker.io/seldonio/engine:1.8.0-dev
      value: docker.io/seldonio/seldon-core-executor:1.8.0-dev
    image: docker.io/seldonio/seldon-core-operator:1.8.0-dev

Model Details

kubectl get deploy -n seldon-system seldon-controller-manager -o yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "1"
    meta.helm.sh/release-name: seldon-core
    meta.helm.sh/release-namespace: seldon-system
  creationTimestamp: "2021-05-17T10:48:40Z"
  generation: 1
  labels:
    app: seldon
    app.kubernetes.io/instance: seldon-core
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: seldon-core-operator
    app.kubernetes.io/version: 1.8.0-dev
    control-plane: seldon-controller-manager
  name: seldon-controller-manager
  namespace: seldon-system
  resourceVersion: "4757781"
  selfLink: /apis/apps/v1/namespaces/seldon-system/deployments/seldon-controller-manager
  uid: 59507518-9473-467b-8ac8-6c13db62912c
spec:
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: seldon
      app.kubernetes.io/instance: seldon1
      app.kubernetes.io/name: seldon
      app.kubernetes.io/version: v0.5
      control-plane: seldon-controller-manager
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      annotations:
        prometheus.io/scrape: "true"
        sidecar.istio.io/inject: "false"
      creationTimestamp: null
      labels:
        app: seldon
        app.kubernetes.io/instance: seldon1
        app.kubernetes.io/name: seldon
        app.kubernetes.io/version: v0.5
        control-plane: seldon-controller-manager
    spec:
      containers:
      - args:
        - --enable-leader-election
        - --webhook-port=4443
        - --create-resources=$(MANAGER_CREATE_RESOURCES)
        - --log-level=$(MANAGER_LOG_LEVEL)
        - ""
        command:
        - /manager
        env:
        - name: MANAGER_LOG_LEVEL
          value: INFO
        - name: WATCH_NAMESPACE
        - name: RELATED_IMAGE_EXECUTOR
        - name: RELATED_IMAGE_ENGINE
        - name: RELATED_IMAGE_STORAGE_INITIALIZER
        - name: RELATED_IMAGE_SKLEARNSERVER
        - name: RELATED_IMAGE_XGBOOSTSERVER
        - name: RELATED_IMAGE_MLFLOWSERVER
        - name: RELATED_IMAGE_TFPROXY
        - name: RELATED_IMAGE_TENSORFLOW
        - name: RELATED_IMAGE_EXPLAINER
        - name: RELATED_IMAGE_MOCK_CLASSIFIER
        - name: MANAGER_CREATE_RESOURCES
          value: "false"
        - name: POD_NAMESPACE
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
        - name: CONTROLLER_ID
        - name: AMBASSADOR_ENABLED
          value: "false"
        - name: AMBASSADOR_SINGLE_NAMESPACE
          value: "false"
        - name: ENGINE_CONTAINER_IMAGE_AND_VERSION
          value: docker.io/seldonio/engine:1.8.0-dev
        - name: ENGINE_CONTAINER_IMAGE_PULL_POLICY
          value: IfNotPresent
        - name: ENGINE_CONTAINER_SERVICE_ACCOUNT_NAME
          value: default
        - name: ENGINE_CONTAINER_USER
          value: "8888"
        - name: ENGINE_LOG_MESSAGES_EXTERNALLY
          value: "false"
        - name: PREDICTIVE_UNIT_HTTP_SERVICE_PORT
          value: "9000"
        - name: PREDICTIVE_UNIT_GRPC_SERVICE_PORT
          value: "9500"
        - name: PREDICTIVE_UNIT_DEFAULT_ENV_SECRET_REF_NAME
        - name: PREDICTIVE_UNIT_METRICS_PORT_NAME
          value: metrics
        - name: ENGINE_SERVER_GRPC_PORT
          value: "5001"
        - name: ENGINE_SERVER_PORT
          value: "8000"
        - name: ENGINE_PROMETHEUS_PATH
          value: /prometheus
        - name: ISTIO_ENABLED
          value: "true"
        - name: KEDA_ENABLED
          value: "false"
        - name: ISTIO_GATEWAY
          value: istio-system/seldon-gateway
        - name: ISTIO_TLS_MODE
        - name: USE_EXECUTOR
          value: "true"
        - name: EXECUTOR_CONTAINER_IMAGE_AND_VERSION
          value: docker.io/seldonio/seldon-core-executor:1.8.0-dev
        - name: EXECUTOR_CONTAINER_IMAGE_PULL_POLICY
          value: IfNotPresent
        - name: EXECUTOR_PROMETHEUS_PATH
          value: /prometheus
        - name: EXECUTOR_SERVER_PORT
          value: "8000"
        - name: EXECUTOR_CONTAINER_USER
          value: "8888"
        - name: EXECUTOR_CONTAINER_SERVICE_ACCOUNT_NAME
          value: default
        - name: EXECUTOR_SERVER_METRICS_PORT_NAME
          value: metrics
        - name: EXECUTOR_REQUEST_LOGGER_DEFAULT_ENDPOINT
          value: http://default-broker
        - name: DEFAULT_USER_ID
          value: "8888"
        - name: EXECUTOR_DEFAULT_CPU_REQUEST
          value: 500m
        - name: EXECUTOR_DEFAULT_MEMORY_REQUEST
          value: 512Mi
        - name: EXECUTOR_DEFAULT_CPU_LIMIT
          value: 500m
        - name: EXECUTOR_DEFAULT_MEMORY_LIMIT
          value: 512Mi
        - name: ENGINE_DEFAULT_CPU_REQUEST
          value: 500m
        - name: ENGINE_DEFAULT_MEMORY_REQUEST
          value: 512Mi
        - name: ENGINE_DEFAULT_CPU_LIMIT
          value: 500m
        - name: ENGINE_DEFAULT_MEMORY_LIMIT
          value: 512Mi
        image: docker.io/seldonio/seldon-core-operator:1.8.0-dev
        imagePullPolicy: IfNotPresent
        name: manager
        ports:
        - containerPort: 4443
          name: webhook-server
          protocol: TCP
        - containerPort: 8080
          name: metrics
          protocol: TCP
        resources:
          limits:
            cpu: 500m
            memory: 300Mi
          requests:
            cpu: 100m
            memory: 200Mi
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /tmp/k8s-webhook-server/serving-certs
          name: cert
          readOnly: true
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext:
        runAsUser: 8888
      serviceAccount: seldon-manager
      serviceAccountName: seldon-manager
      terminationGracePeriodSeconds: 10
      volumes:
      - name: cert
        secret:
          defaultMode: 420
          secretName: seldon-webhook-server-cert
status:
  availableReplicas: 1
  conditions:
  - lastTransitionTime: "2021-05-17T10:48:44Z"
    lastUpdateTime: "2021-05-17T10:48:44Z"
    message: Deployment has minimum availability.
    reason: MinimumReplicasAvailable
    status: "True"
    type: Available
  - lastTransitionTime: "2021-05-17T10:48:41Z"
    lastUpdateTime: "2021-05-17T10:48:44Z"
    message: ReplicaSet "seldon-controller-manager-6dbb9fbd87" has successfully progressed.
    reason: NewReplicaSetAvailable
    status: "True"
    type: Progressing
  observedGeneration: 1
  readyReplicas: 1
  replicas: 1
  updatedReplicas: 1

kubectl logs -n seldon-system seldon-controller-manager-6dbb9fbd87-4rtct -f

I0517 10:48:45.397034 1 request.go:621] Throttling request took 1.034903882s, request: GET:https://10.124.16.1:443/apis/batch/v1?timeout=32s {"level":"info","ts":1621248525.9033751,"logger":"controller-runtime.metrics","msg":"metrics server is starting to listen","addr":":8080"} {"level":"info","ts":1621248525.9049253,"logger":"controller-runtime.builder","msg":"Registering a mutating webhook","GVK":"machinelearning.seldon.io/v1alpha2, Kind=SeldonDeployment","path":"/mutate-machinelearning-seldon-io-v1alpha2-seldondeployment"} {"level":"info","ts":1621248525.9050074,"logger":"controller-runtime.webhook","msg":"registering webhook","path":"/mutate-machinelearning-seldon-io-v1alpha2-seldondeployment"} {"level":"info","ts":1621248525.9050608,"logger":"controller-runtime.builder","msg":"Registering a validating webhook","GVK":"machinelearning.seldon.io/v1alpha2, Kind=SeldonDeployment","path":"/validate-machinelearning-seldon-io-v1alpha2-seldondeployment"} {"level":"info","ts":1621248525.9050915,"logger":"controller-runtime.webhook","msg":"registering webhook","path":"/validate-machinelearning-seldon-io-v1alpha2-seldondeployment"} {"level":"info","ts":1621248525.905159,"logger":"controller-runtime.builder","msg":"Registering a mutating webhook","GVK":"machinelearning.seldon.io/v1alpha3, Kind=SeldonDeployment","path":"/mutate-machinelearning-seldon-io-v1alpha3-seldondeployment"} {"level":"info","ts":1621248525.9052534,"logger":"controller-runtime.webhook","msg":"registering webhook","path":"/mutate-machinelearning-seldon-io-v1alpha3-seldondeployment"} {"level":"info","ts":1621248525.9052784,"logger":"controller-runtime.builder","msg":"Registering a validating webhook","GVK":"machinelearning.seldon.io/v1alpha3, Kind=SeldonDeployment","path":"/validate-machinelearning-seldon-io-v1alpha3-seldondeployment"} {"level":"info","ts":1621248525.9053478,"logger":"controller-runtime.webhook","msg":"registering webhook","path":"/validate-machinelearning-seldon-io-v1alpha3-seldondeployment"} {"level":"info","ts":1621248525.9053905,"logger":"controller-runtime.builder","msg":"Registering a mutating webhook","GVK":"machinelearning.seldon.io/v1, Kind=SeldonDeployment","path":"/mutate-machinelearning-seldon-io-v1-seldondeployment"} {"level":"info","ts":1621248525.9054172,"logger":"controller-runtime.webhook","msg":"registering webhook","path":"/mutate-machinelearning-seldon-io-v1-seldondeployment"} {"level":"info","ts":1621248525.905451,"logger":"controller-runtime.builder","msg":"Registering a validating webhook","GVK":"machinelearning.seldon.io/v1, Kind=SeldonDeployment","path":"/validate-machinelearning-seldon-io-v1-seldondeployment"} {"level":"info","ts":1621248525.9054766,"logger":"controller-runtime.webhook","msg":"registering webhook","path":"/validate-machinelearning-seldon-io-v1-seldondeployment"} {"level":"info","ts":1621248525.9055269,"logger":"setup","msg":"starting manager"} I0517 10:48:45.905932 1 leaderelection.go:242] attempting to acquire leader lease seldon-system/a33bd623.machinelearning.seldon.io... {"level":"info","ts":1621248526.006426,"logger":"controller-runtime.webhook.webhooks","msg":"starting webhook server"} {"level":"info","ts":1621248526.006426,"logger":"controller-runtime.manager","msg":"starting metrics server","path":"/metrics"} {"level":"info","ts":1621248526.0068915,"logger":"controller-runtime.certwatcher","msg":"Updated current TLS certificate"} {"level":"info","ts":1621248526.0070863,"logger":"controller-runtime.webhook","msg":"serving webhook server","host":"","port":4443} {"level":"info","ts":1621248526.0071626,"logger":"controller-runtime.certwatcher","msg":"Starting certificate watcher"} I0517 10:49:03.498490 1 leaderelection.go:252] successfully acquired lease seldon-system/a33bd623.machinelearning.seldon.io {"level":"info","ts":1621248543.4987247,"logger":"controller","msg":"Starting EventSource","reconcilerGroup":"machinelearning.seldon.io","reconcilerKind":"SeldonDeployment","controller":"seldon-controller-manager","source":"kind source: /, Kind="} {"level":"info","ts":1621248544.299195,"logger":"controller","msg":"Starting EventSource","reconcilerGroup":"machinelearning.seldon.io","reconcilerKind":"SeldonDeployment","controller":"seldon-controller-manager","source":"kind source: /, Kind="} {"level":"info","ts":1621248544.299298,"logger":"controller","msg":"Starting EventSource","reconcilerGroup":"machinelearning.seldon.io","reconcilerKind":"SeldonDeployment","controller":"seldon-controller-manager","source":"kind source: /, Kind="} {"level":"info","ts":1621248544.2995644,"logger":"controller","msg":"Starting EventSource","reconcilerGroup":"machinelearning.seldon.io","reconcilerKind":"SeldonDeployment","controller":"seldon-controller-manager","source":"kind source: /, Kind="} {"level":"info","ts":1621248544.2995968,"logger":"controller","msg":"Starting Controller","reconcilerGroup":"machinelearning.seldon.io","reconcilerKind":"SeldonDeployment","controller":"seldon-controller-manager"} {"level":"info","ts":1621248544.2996142,"logger":"controller","msg":"Starting workers","reconcilerGroup":"machinelearning.seldon.io","reconcilerKind":"SeldonDeployment","controller":"seldon-controller-manager","worker count":1}

ukclivecox commented 3 years ago

Does you cluster have any particular RBAC? It looks like the network call is being blocked.

Also, have you checked the manager pod in seldon-system is running ok? - it looks so from above log though.

shudhanshh12 commented 3 years ago

yes, the pod is running fine, also this is the fresh setup, and just created the cluster there is no additional network policy or RBAC applied.

I created the gke cluster and then deployed the Seldon using helm.

helm install seldon-core seldon-core-operator --repo https://storage.googleapis.com/seldon-charts --set istio.enabled=true --set usageMetrics.enabled=true --namespace seldon-system --set crd.create=true --set certManager.enabled=true

ukclivecox commented 3 years ago

What type of cluster are you running on?

Can you check there is a ValidatingWebhookConfguration created and the certificates have been created by certmanager?

Can you maybe try an install without certmanager to see if that works?

shudhanshh12 commented 3 years ago
  1. It a google managed zonal cluster.

  2. kubectl get ValidatingWebhookConfguration --all-namespaces error: the server doesn't have a resource type "ValidatingWebhookConfguration"

  3. kubectl get certificates --all-namespaces

    NAMESPACE           NAME              READY   SECRET                AGE
    cert-manager-test   selfsigned-cert   True    selfsigned-cert-tls   9d
ukclivecox commented 3 years ago

That's strange you should see something like:

kubectl get validatingwebhookconfiguration
NAME                                                    WEBHOOKS   AGE
istiod-istio-system                                     1          18h
seldon-validating-webhook-configuration-seldon-system   3          18h

Maybe try to uninstall and ensure there are no mutatingwebhookconfiguration or validatingwebhookconfiguration left and reinstall?

shudhanshh12 commented 3 years ago

done

kubectl get validatingwebhookconfiguration 
NAME                                                    WEBHOOKS   AGE
cert-manager-webhook                                    1          1h
istiod-istio-system                                     1          9h
nodelimit.config.common-webhooks.networking.gke.io      1          1h
seldon-validating-webhook-configuration-seldon-system   3          1h
validation-webhook.snapshot.storage.k8s.io              1          7d23h
shudhanshh12 commented 3 years ago

getting the same error:

Error from server (InternalError): error when creating "deployment.yaml": Internal error occurred: failed calling webhook "v1alpha2.vseldondeployment.kb.io": Post "https://seldon-webhook-service.seldon-system.svc:443/validate-machinelearning-seldon-io-v1alpha2-seldondeployment?timeout=30s": dial tcp 10.80.1.4:4443: i/o timeout

shudhanshh12 commented 3 years ago

@cliveseldon can you please help me to debug this?

shudhanshh12 commented 3 years ago

kubectl -n seldon-system logs seldon-controller-manager-78fb87cd68-grc9h -p

Using deprecated annotation `kubectl.kubernetes.io/default-logs-container` in pod/seldon-controller-manager-78fb87cd68-grc9h. Please use `kubectl.kubernetes.io/default-container` instead
{"level":"error","ts":1621374568.822609,"logger":"controller-runtime.manager","msg":"Failed to get API Group-Resources","error":"Get \"https://10.124.16.1:443/api?timeout=32s\": dial tcp 10.124.16.1:443: connect: connection refused","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/pkg/mod/github.com/go-logr/zapr@v0.1.1/zapr.go:128\nsigs.k8s.io/controller-runtime/pkg/manager.New\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.6.4/pkg/manager/manager.go:279\nmain.main\n\t/workspace/main.go:156\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:203"}
{"level":"error","ts":1621374568.822711,"logger":"setup","msg":"unable to start manager","error":"Get \"https://10.124.16.1:443/api?timeout=32s\": dial tcp 10.124.16.1:443: connect: connection refused","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/pkg/mod/github.com/go-logr/zapr@v0.1.1/zapr.go:128\nmain.main\n\t/workspace/main.go:165\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:203"}
ukclivecox commented 3 years ago

Can you try the core_istio ansible playbook on GKE found here: https://github.com/SeldonIO/tempo/tree/master/ansible

I tested with this and all was functioning well on GKE 1.19 cluster today.

shudhanshh12 commented 3 years ago

no progress,

getting the same issue, by any chance is it related to private cluster?

Error from server (InternalError): error when creating "deployment.yaml": Internal error occurred: failed calling webhook "v1alpha2.vseldondeployment.kb.io": Post "https://seldon-webhook-service.seldon-system.svc:443/validate-machinelearning-seldon-io-v1alpha2-seldondeployment?timeout=30s": context deadline exceeded

also when deployed using ansible I'm not able to see

kubectl get pods -n seldon-system
NAME READY STATUS RESTARTS AGE seldon-controller-manager-cd97b9c85-whdx6 1/1 Running 0 9m25s

shudhanshh12 commented 3 years ago
Screenshot 2021-05-31 at 4 07 02 AM
ukclivecox commented 3 years ago

OK. What type of cluster are you running. The above was tested on a standard GKE 1.19 cluster.

apurvamishra20 commented 3 years ago

I am facing the same issue while creating deployment in Seldon. The manager node is running fine but the deployment creation fails with below error: Internal error occurred: failed calling webhook "v1.vseldondeployment.kb.io": Post https://seldon-webhook-service.fusion.svc:443/validate-machinelearning-seldon-io-v1-seldondeployment?timeout=30s: context deadline exceeded

I am running it on GKE 1.18 cluster. Does it work on 1.19 only?

ukclivecox commented 3 years ago

Could be related to this: https://github.com/knative/serving/issues/4868

ukclivecox commented 2 years ago

please reopen if still an issue