SeldonIO / seldon-core

An MLOps framework to package, deploy, monitor and manage thousands of production machine learning models
https://www.seldon.io/tech/products/core/
Other
4.29k stars 824 forks source link

Deploying transofmer model from mlflow fails in v2 #4767

Open nadworny opened 1 year ago

nadworny commented 1 year ago

Describe the bug

Deploying a transformer model (huggingface) on v2 using mlflow requirement fails with the following error:

apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: xxx
  namespace: seldon-mesh
spec:
  requirements:
  - mlflow
  secretName: xxx-secret
  storageUri: azureblob://xxx
status:
  conditions:
  - lastTransitionTime: "2023-03-30T08:08:53Z"
    reason: 'failed to schedule model xxx. [failed server filter SharingServerFilter
      for server replica mlserver : sharing false failed replica filter RequirementsReplicaFilter
      for server replica triton:0 : model requirements [mlflow[] replica capabilities
      [triton dali fil onnx openvino python pytorch tensorflow tensorrt[]]'
    status: "False"
    type: ModelReady
  - lastTransitionTime: "2023-03-30T08:08:53Z"
    reason: 'failed to schedule model xxx. [failed server filter SharingServerFilter
      for server replica mlserver : sharing false failed replica filter RequirementsReplicaFilter
      for server replica triton:0 : model requirements [mlflow[] replica capabilities
      [triton dali fil onnx openvino python pytorch tensorflow tensorrt[]]'
    status: "False"
    type: Ready
  replicas: 1

I'm not sure why it's trying to deploy to triton if I provided mlflow as a requirement.

Also, I saw that there was an error in the controller manager which might be related? The below might be related to the fact that this package is coming from a private pypi - is there a possibility to see some more detailed logs of when the model is being loaded? I would expect an error saying that pip couldn't install it.

1.6801628338098726e+09    INFO    schedulerClient.SubscribeModelEvents    Received event    {"name": "xxx", "version": 1, "generation": 1, "state": "ModelFailed", "reason": "rpc error: code = Internal desc = │
builtins.ModuleNotFoundError: No module named 'xxx_data_model'"}

To reproduce

  1. Train a model and log into mlflow.
  2. Deploy
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    metadata:
    name: xxx
    spec:
    storageUri: "azureblob://xxx"
    secretName: "xxx-secret"
    requirements:
    - mlflow

Expected behaviour

Model is deployed on mlserver.

Environment

ukclivecox commented 1 year ago

This looks like the mlserver Server has sharing false for some reason. The status is giving you the reasons it failed to scheudle. For triton the requirements don't match and for mlserver the sharing setting is false.

Can you show the Server resource?

nadworny commented 1 year ago

Here it is:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  annotations:
    meta.helm.sh/release-name: seldon-v2-servers
    meta.helm.sh/release-namespace: seldon-mesh
  creationTimestamp: "2023-03-30T07:45:09Z"
  generation: 1
  labels:
    app: seldon-server
    app.kubernetes.io/managed-by: Helm
  name: mlserver
  namespace: seldon-mesh
  ownerReferences:
  - apiVersion: mlops.seldon.io/v1alpha1
    blockOwnerDeletion: true
    controller: true
    kind: Server
    name: mlserver
    uid: c62ee0c5-a49b-4757-941f-f24707ccc6db
  resourceVersion: "356890673"
  uid: 2c1c6b4b-e935-4eef-9975-e720b5899ed9
spec:
  podManagementPolicy: Parallel
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      seldon-server-name: mlserver
  serviceName: mlserver
  template:
    metadata:
      annotations:
        meta.helm.sh/release-name: seldon-v2-servers
        meta.helm.sh/release-namespace: seldon-mesh
      creationTimestamp: null
      labels:
        app: seldon-server
        app.kubernetes.io/managed-by: Helm
        seldon-server-name: mlserver
      name: mlserver
      namespace: seldon-mesh
    spec:
      containers:
      - image: docker.io/seldonio/seldon-rclone:2.3.0
        imagePullPolicy: IfNotPresent
        lifecycle:
          preStop:
            httpGet:
              path: terminate
              port: 9007
              scheme: HTTP
        name: rclone
        ports:
        - containerPort: 5572
          name: rclone
          protocol: TCP
        readinessProbe:
          failureThreshold: 3
          initialDelaySeconds: 5
          periodSeconds: 5
          successThreshold: 1
          tcpSocket:
            port: 5572
          timeoutSeconds: 1
        resources:
          limits:
            memory: 128Mi
          requests:
            cpu: 50m
            memory: 128Mi
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /mnt/agent
          name: mlserver-models
      - args:
        - --tracing-config-path=/mnt/tracing/tracing.json
        command:
        - /bin/agent
        env:
        - name: SELDON_SERVER_CAPABILITIES
          value: mlserver,alibi-detect,alibi-explain,huggingface,lightgbm,mlflow,python,sklearn,spark-mlib,xgboost
        - name: SELDON_MODEL_INFERENCE_LAG_THRESHOLD
          value: "30"
        - name: SELDON_MODEL_INACTIVE_SECONDS_THRESHOLD
          value: "600"
        - name: SELDON_SCALING_STATS_PERIOD_SECONDS
          value: "20"
        - name: SELDON_OVERCOMMIT_PERCENTAGE
          value: "10"
        - name: CONTROL_PLANE_SECURITY_PROTOCOL
          value: PLAINTEXT
        - name: CONTROL_PLANE_CLIENT_TLS_SECRET_NAME
          value: seldon-controlplane-client
        - name: CONTROL_PLANE_SERVER_TLS_SECRET_NAME
          value: seldon-controlplane-server
        - name: CONTROL_PLANE_CLIENT_TLS_KEY_LOCATION
          value: /tmp/certs/cpc/tls.key
        - name: CONTROL_PLANE_CLIENT_TLS_CRT_LOCATION
          value: /tmp/certs/cpc/tls.crt
        - name: CONTROL_PLANE_CLIENT_TLS_CA_LOCATION
          value: /tmp/certs/cpc/ca.crt
        - name: CONTROL_PLANE_SERVER_TLS_CA_LOCATION
          value: /tmp/certs/cps/ca.crt
        - name: ENVOY_SECURITY_PROTOCOL
          value: PLAINTEXT
        - name: ENVOY_UPSTREAM_SERVER_TLS_SECRET_NAME
          value: seldon-upstream-server
        - name: ENVOY_UPSTREAM_CLIENT_TLS_SECRET_NAME
          value: seldon-upstream-client
        - name: ENVOY_UPSTREAM_SERVER_TLS_KEY_LOCATION
          value: /tmp/certs/dus/tls.key
        - name: ENVOY_UPSTREAM_SERVER_TLS_CRT_LOCATION
          value: /tmp/certs/dus/tls.crt
        - name: ENVOY_UPSTREAM_SERVER_TLS_CA_LOCATION
          value: /tmp/certs/dus/ca.crt
        - name: ENVOY_UPSTREAM_CLIENT_TLS_CA_LOCATION
          value: /tmp/certs/duc/ca.crt
        - name: SELDON_SERVER_HTTP_PORT
          value: "9000"
        - name: SELDON_SERVER_GRPC_PORT
          value: "9500"
        - name: SELDON_REVERSE_PROXY_HTTP_PORT
          value: "9001"
        - name: SELDON_REVERSE_PROXY_GRPC_PORT
          value: "9501"
        - name: SELDON_SCHEDULER_HOST
          value: seldon-scheduler
        - name: SELDON_SCHEDULER_PORT
          value: "9005"
        - name: SELDON_SCHEDULER_TLS_PORT
          value: "9055"
        - name: SELDON_METRICS_PORT
          value: "9006"
        - name: SELDON_DRAINER_PORT
          value: "9007"
        - name: AGENT_TLS_SECRET_NAME
        - name: AGENT_TLS_FOLDER_PATH
        - name: SELDON_SERVER_TYPE
          value: mlserver
        - name: SELDON_ENVOY_HOST
          value: seldon-mesh
        - name: SELDON_ENVOY_PORT
          value: "80"
        - name: POD_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.name
        - name: POD_NAMESPACE
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
        - name: MEMORY_REQUEST
          valueFrom:
            resourceFieldRef:
              containerName: mlserver
              divisor: "0"
              resource: requests.memory
        image: docker.io/seldonio/seldon-agent:2.3.0
        imagePullPolicy: IfNotPresent
        lifecycle:
          preStop:
            httpGet:
              path: terminate
              port: 9007
              scheme: HTTP
        name: agent
        ports:
        - containerPort: 9501
          name: grpc
          protocol: TCP
        - containerPort: 9001
          name: http
          protocol: TCP
        - containerPort: 9006
          name: metrics
          protocol: TCP
        resources:
          limits:
            memory: 1Gi
          requests:
            cpu: 200m
            memory: 1Gi
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /mnt/agent
          name: mlserver-models
        - mountPath: /mnt/config
          name: config-volume
        - mountPath: /mnt/tracing
          name: tracing-config-volume
      - env:
        - name: MLSERVER_HTTP_PORT
          value: "9000"
        - name: MLSERVER_GRPC_PORT
          value: "9500"
        - name: MLSERVER_MODELS_DIR
          value: /mnt/agent/models
        - name: MLSERVER_MODEL_PARALLEL_WORKERS
          value: "1"
        - name: MLSERVER_LOAD_MODELS_AT_STARTUP
          value: "false"
        - name: MLSERVER_GRPC_MAX_MESSAGE_LENGTH
          value: "1048576000"
        image: docker.io/seldonio/mlserver:1.2.4
        imagePullPolicy: IfNotPresent
        lifecycle:
          preStop:
            httpGet:
              path: terminate
              port: 9007
              scheme: HTTP
        livenessProbe:
          failureThreshold: 3
          httpGet:
            path: /v2/health/live
            port: server-http
            scheme: HTTP
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
        name: mlserver
        ports:
        - containerPort: 9500
          name: server-grpc
          protocol: TCP
        - containerPort: 9000
          name: server-http
          protocol: TCP
        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /v2/health/live
            port: server-http
            scheme: HTTP
          initialDelaySeconds: 5
          periodSeconds: 5
          successThreshold: 1
          timeoutSeconds: 1
        resources:
          limits:
            memory: 1Gi
          requests:
            cpu: 100m
            memory: 1Gi
        startupProbe:
          failureThreshold: 10
          httpGet:
            path: /v2/health/live
            port: server-http
            scheme: HTTP
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /mnt/agent
          name: mlserver-models
          readOnly: true
        - mountPath: /mnt/certs
          name: downstream-ca-certs
          readOnly: true
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext:
        fsGroup: 1000
        runAsGroup: 1000
        runAsNonRoot: true
        runAsUser: 1000
      serviceAccount: seldon-server
      serviceAccountName: seldon-server
      terminationGracePeriodSeconds: 120
      volumes:
      - name: downstream-ca-certs
        secret:
          defaultMode: 420
          optional: true
          secretName: seldon-downstream-server
      - configMap:
          defaultMode: 420
          name: seldon-agent
        name: config-volume
      - configMap:
          defaultMode: 420
          name: seldon-tracing
        name: tracing-config-volume
  updateStrategy:
    rollingUpdate:
      partition: 0
    type: RollingUpdate
  volumeClaimTemplates:
  - apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      creationTimestamp: null
      name: mlserver-models
    spec:
      accessModes:
      - ReadWriteOnce
      resources:
        requests:
          storage: 1Gi
      volumeMode: Filesystem
    status:
      phase: Pending
status:
  availableReplicas: 1
  collisionCount: 0
  currentReplicas: 1
  currentRevision: mlserver-5957fd5f85
  observedGeneration: 1
  readyReplicas: 1
  replicas: 1
  updateRevision: mlserver-5957fd5f85
  updatedReplicas: 1
ukclivecox commented 1 year ago

The no module found error suggests there are missing classes on MLServer. Have you looked to see if your model confguration differs in any way from the exampe in the model zoo https://docs.seldon.io/projects/seldon-core/en/v2/contents/examples/model-zoo.html#mlflow-wine-model.

Also have you tested locally with MLServer the running of this model?

nadworny commented 1 year ago

Yes, the model runs fine locally if all the dependencies are installed correctly. I assume, as mentioned in the issue, that the missing package couldn't be installed as it comes from our internal pypi. What I would expect though is that I get some kind of error during the deployment or installation from seldon which I don't see anywhere.

Also, it doesn't explain the fact that it's trying to deploy to triton?