Open nadworny opened 1 year ago
This looks like the mlserver Server has sharing false for some reason. The status is giving you the reasons it failed to scheudle. For triton the requirements don't match and for mlserver the sharing setting is false.
Can you show the Server resource?
Here it is:
apiVersion: apps/v1
kind: StatefulSet
metadata:
annotations:
meta.helm.sh/release-name: seldon-v2-servers
meta.helm.sh/release-namespace: seldon-mesh
creationTimestamp: "2023-03-30T07:45:09Z"
generation: 1
labels:
app: seldon-server
app.kubernetes.io/managed-by: Helm
name: mlserver
namespace: seldon-mesh
ownerReferences:
- apiVersion: mlops.seldon.io/v1alpha1
blockOwnerDeletion: true
controller: true
kind: Server
name: mlserver
uid: c62ee0c5-a49b-4757-941f-f24707ccc6db
resourceVersion: "356890673"
uid: 2c1c6b4b-e935-4eef-9975-e720b5899ed9
spec:
podManagementPolicy: Parallel
replicas: 1
revisionHistoryLimit: 10
selector:
matchLabels:
seldon-server-name: mlserver
serviceName: mlserver
template:
metadata:
annotations:
meta.helm.sh/release-name: seldon-v2-servers
meta.helm.sh/release-namespace: seldon-mesh
creationTimestamp: null
labels:
app: seldon-server
app.kubernetes.io/managed-by: Helm
seldon-server-name: mlserver
name: mlserver
namespace: seldon-mesh
spec:
containers:
- image: docker.io/seldonio/seldon-rclone:2.3.0
imagePullPolicy: IfNotPresent
lifecycle:
preStop:
httpGet:
path: terminate
port: 9007
scheme: HTTP
name: rclone
ports:
- containerPort: 5572
name: rclone
protocol: TCP
readinessProbe:
failureThreshold: 3
initialDelaySeconds: 5
periodSeconds: 5
successThreshold: 1
tcpSocket:
port: 5572
timeoutSeconds: 1
resources:
limits:
memory: 128Mi
requests:
cpu: 50m
memory: 128Mi
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /mnt/agent
name: mlserver-models
- args:
- --tracing-config-path=/mnt/tracing/tracing.json
command:
- /bin/agent
env:
- name: SELDON_SERVER_CAPABILITIES
value: mlserver,alibi-detect,alibi-explain,huggingface,lightgbm,mlflow,python,sklearn,spark-mlib,xgboost
- name: SELDON_MODEL_INFERENCE_LAG_THRESHOLD
value: "30"
- name: SELDON_MODEL_INACTIVE_SECONDS_THRESHOLD
value: "600"
- name: SELDON_SCALING_STATS_PERIOD_SECONDS
value: "20"
- name: SELDON_OVERCOMMIT_PERCENTAGE
value: "10"
- name: CONTROL_PLANE_SECURITY_PROTOCOL
value: PLAINTEXT
- name: CONTROL_PLANE_CLIENT_TLS_SECRET_NAME
value: seldon-controlplane-client
- name: CONTROL_PLANE_SERVER_TLS_SECRET_NAME
value: seldon-controlplane-server
- name: CONTROL_PLANE_CLIENT_TLS_KEY_LOCATION
value: /tmp/certs/cpc/tls.key
- name: CONTROL_PLANE_CLIENT_TLS_CRT_LOCATION
value: /tmp/certs/cpc/tls.crt
- name: CONTROL_PLANE_CLIENT_TLS_CA_LOCATION
value: /tmp/certs/cpc/ca.crt
- name: CONTROL_PLANE_SERVER_TLS_CA_LOCATION
value: /tmp/certs/cps/ca.crt
- name: ENVOY_SECURITY_PROTOCOL
value: PLAINTEXT
- name: ENVOY_UPSTREAM_SERVER_TLS_SECRET_NAME
value: seldon-upstream-server
- name: ENVOY_UPSTREAM_CLIENT_TLS_SECRET_NAME
value: seldon-upstream-client
- name: ENVOY_UPSTREAM_SERVER_TLS_KEY_LOCATION
value: /tmp/certs/dus/tls.key
- name: ENVOY_UPSTREAM_SERVER_TLS_CRT_LOCATION
value: /tmp/certs/dus/tls.crt
- name: ENVOY_UPSTREAM_SERVER_TLS_CA_LOCATION
value: /tmp/certs/dus/ca.crt
- name: ENVOY_UPSTREAM_CLIENT_TLS_CA_LOCATION
value: /tmp/certs/duc/ca.crt
- name: SELDON_SERVER_HTTP_PORT
value: "9000"
- name: SELDON_SERVER_GRPC_PORT
value: "9500"
- name: SELDON_REVERSE_PROXY_HTTP_PORT
value: "9001"
- name: SELDON_REVERSE_PROXY_GRPC_PORT
value: "9501"
- name: SELDON_SCHEDULER_HOST
value: seldon-scheduler
- name: SELDON_SCHEDULER_PORT
value: "9005"
- name: SELDON_SCHEDULER_TLS_PORT
value: "9055"
- name: SELDON_METRICS_PORT
value: "9006"
- name: SELDON_DRAINER_PORT
value: "9007"
- name: AGENT_TLS_SECRET_NAME
- name: AGENT_TLS_FOLDER_PATH
- name: SELDON_SERVER_TYPE
value: mlserver
- name: SELDON_ENVOY_HOST
value: seldon-mesh
- name: SELDON_ENVOY_PORT
value: "80"
- name: POD_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.name
- name: POD_NAMESPACE
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
- name: MEMORY_REQUEST
valueFrom:
resourceFieldRef:
containerName: mlserver
divisor: "0"
resource: requests.memory
image: docker.io/seldonio/seldon-agent:2.3.0
imagePullPolicy: IfNotPresent
lifecycle:
preStop:
httpGet:
path: terminate
port: 9007
scheme: HTTP
name: agent
ports:
- containerPort: 9501
name: grpc
protocol: TCP
- containerPort: 9001
name: http
protocol: TCP
- containerPort: 9006
name: metrics
protocol: TCP
resources:
limits:
memory: 1Gi
requests:
cpu: 200m
memory: 1Gi
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /mnt/agent
name: mlserver-models
- mountPath: /mnt/config
name: config-volume
- mountPath: /mnt/tracing
name: tracing-config-volume
- env:
- name: MLSERVER_HTTP_PORT
value: "9000"
- name: MLSERVER_GRPC_PORT
value: "9500"
- name: MLSERVER_MODELS_DIR
value: /mnt/agent/models
- name: MLSERVER_MODEL_PARALLEL_WORKERS
value: "1"
- name: MLSERVER_LOAD_MODELS_AT_STARTUP
value: "false"
- name: MLSERVER_GRPC_MAX_MESSAGE_LENGTH
value: "1048576000"
image: docker.io/seldonio/mlserver:1.2.4
imagePullPolicy: IfNotPresent
lifecycle:
preStop:
httpGet:
path: terminate
port: 9007
scheme: HTTP
livenessProbe:
failureThreshold: 3
httpGet:
path: /v2/health/live
port: server-http
scheme: HTTP
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
name: mlserver
ports:
- containerPort: 9500
name: server-grpc
protocol: TCP
- containerPort: 9000
name: server-http
protocol: TCP
readinessProbe:
failureThreshold: 3
httpGet:
path: /v2/health/live
port: server-http
scheme: HTTP
initialDelaySeconds: 5
periodSeconds: 5
successThreshold: 1
timeoutSeconds: 1
resources:
limits:
memory: 1Gi
requests:
cpu: 100m
memory: 1Gi
startupProbe:
failureThreshold: 10
httpGet:
path: /v2/health/live
port: server-http
scheme: HTTP
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /mnt/agent
name: mlserver-models
readOnly: true
- mountPath: /mnt/certs
name: downstream-ca-certs
readOnly: true
dnsPolicy: ClusterFirst
restartPolicy: Always
schedulerName: default-scheduler
securityContext:
fsGroup: 1000
runAsGroup: 1000
runAsNonRoot: true
runAsUser: 1000
serviceAccount: seldon-server
serviceAccountName: seldon-server
terminationGracePeriodSeconds: 120
volumes:
- name: downstream-ca-certs
secret:
defaultMode: 420
optional: true
secretName: seldon-downstream-server
- configMap:
defaultMode: 420
name: seldon-agent
name: config-volume
- configMap:
defaultMode: 420
name: seldon-tracing
name: tracing-config-volume
updateStrategy:
rollingUpdate:
partition: 0
type: RollingUpdate
volumeClaimTemplates:
- apiVersion: v1
kind: PersistentVolumeClaim
metadata:
creationTimestamp: null
name: mlserver-models
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 1Gi
volumeMode: Filesystem
status:
phase: Pending
status:
availableReplicas: 1
collisionCount: 0
currentReplicas: 1
currentRevision: mlserver-5957fd5f85
observedGeneration: 1
readyReplicas: 1
replicas: 1
updateRevision: mlserver-5957fd5f85
updatedReplicas: 1
The no module found error suggests there are missing classes on MLServer. Have you looked to see if your model confguration differs in any way from the exampe in the model zoo https://docs.seldon.io/projects/seldon-core/en/v2/contents/examples/model-zoo.html#mlflow-wine-model.
Also have you tested locally with MLServer the running of this model?
Yes, the model runs fine locally if all the dependencies are installed correctly. I assume, as mentioned in the issue, that the missing package couldn't be installed as it comes from our internal pypi. What I would expect though is that I get some kind of error during the deployment or installation from seldon which I don't see anywhere.
Also, it doesn't explain the fact that it's trying to deploy to triton?
Describe the bug
Deploying a transformer model (huggingface) on v2 using mlflow requirement fails with the following error:
I'm not sure why it's trying to deploy to triton if I provided
mlflow
as a requirement.Also, I saw that there was an error in the controller manager which might be related? The below might be related to the fact that this package is coming from a private pypi - is there a possibility to see some more detailed logs of when the model is being loaded? I would expect an error saying that pip couldn't install it.
To reproduce
Expected behaviour
Model is deployed on mlserver.
Environment