SeldonIO / seldon-core

An MLOps framework to package, deploy, monitor and manage thousands of production machine learning models
https://www.seldon.io/tech/products/core/
Other
4.37k stars 831 forks source link

GPU support with SERVICE_TYPE Model #590

Closed muma378 closed 5 years ago

muma378 commented 5 years ago

Hi, I was trying to deploy a SeldonDeployment to the cluster, which asks for gpu resource and CUDA. I writes the .yaml as the official doc suggests, however the deployment was blocked at "parsing" the CRD stage, which results in no deployment or service was created. It was totally OK to deploy models without gpu required.

I didn't find any example on using gpus, so, my question is: does Seldon-core support GPU? Or does anyone has succeeded in deploying a model with gpu required?

This is a part from my .yaml:

  predictors:
  - annotations:
      predictor_version: v1
    componentSpecs:
    - spec:
        containers:
        - image: xxx
          resources:
            limits:
              alpha.kubernetes.io/nvidia-gpu: 1
          imagePullPolicy: IfNotPresent
          name: xx
          volumeMounts:
          - mountPath: /usr/local/nvidia/bin
            name: bin
          - mountPath: /usr/lib/nvidia
            name: lib
        imagePullSecrets:
        - name: regcred
        terminationGracePeriodSeconds: 1
        volumes:
          - hostPath:
              path: /usr/lib/nvidia-384/bin
            name: bin
          - hostPath:
              path: /usr/lib/nvidia-384
            name: lib
    graph:
      children: []
      endpoint:
        type: GRPC
      name: xx
      type: MODEL
    name: xx
    replicas: 1
ukclivecox commented 5 years ago

Can you provide more details on the error at "parsing" stage? Are you using the "master" branch or an earlier version of Seldon Core?

muma378 commented 5 years ago

Can you provide more details on the error at "parsing" stage? Are you using the "master" branch or an earlier version of Seldon Core?

No, the version for seldon-core and seldon-core-crd both are 0.2.5. Installed with helm locally.

The seldon-core-apiserver reports an error below:

2019-05-26 15:41:08.623  INFO 1 --- [pool-3-thread-1] io.seldon.apife.k8s.DeploymentWatcher    : The time is now 15:41:08
2019-05-26 15:41:08.623  INFO 1 --- [pool-3-thread-1] io.seldon.apife.k8s.DeploymentWatcher    : Watching with rs 3980232
2019-05-26 15:41:08.682  INFO 1 --- [pool-3-thread-1] io.seldon.apife.k8s.DeploymentWatcher    : ADDED
 : {"apiVersion":"machinelearning.seldon.io/v1alpha2","kind":"SeldonDeployment","metadata":{"annotations":{"kubectl.kubernetes.io/last-applied-configuration":"{\"apiVersion\":\"machinelearning.seldon.io/v1alpha2\",\"kind\":\"SeldonDeployment\",\"metadata\":{\"annotations\":{},\"labels\":{\"app\":\"seldon\",\"ksonnet.io/component\":\"facedet\"},\"name\":\"facedet-gpu\",\"namespace\":\"modelzoo\"},\"spec\":{\"annotations\":{\"deployment_version\":\"v1\",\"project_name\":\"facedet\",\"seldon.io/grpc-read-timeout\":\"60000\",\"seldon.io/rest-connection-timeout\":\"60000\",\"seldon.io/rest-read-timeout\":\"60000\"},\"name\":\"facedet-gpu\",\"oauth_key\":\"\",\"oauth_secret\":\"\",\"predictors\":[{\"annotations\":{\"predictor_version\":\"v1\"},\"componentSpecs\":[{\"spec\":{\"containers\":[{\"image\":\"facedet-gpu:v0.1\",\"imagePullPolicy\":\"IfNotPresent\",\"name\":\"facedet\",\"resources\":{\"limits\":{\"alpha.kubernetes.io/nvidia-gpu\":1}},\"volumeMounts\":[{\"mountPath\":\"/usr/local/nvidia/bin\",\"name\":\"bin\"},{\"mountPath\":\"/usr/lib/nvidia\",\"name\":\"lib\"}]}],\"imagePullSecrets\":[{\"name\":\"regcred\"}],\"terminationGracePeriodSeconds\":1,\"volumes\":[{\"hostPath\":{\"path\":\"/usr/lib/nvidia-384/bin\"},\"name\":\"bin\"},{\"hostPath\":{\"path\":\"/usr/lib/nvidia-384\"},\"name\":\"lib\"}]}}],\"graph\":{\"children\":[],\"endpoint\":{\"type\":\"GRPC\"},\"name\":\"facedet\",\"type\":\"MODEL\"},\"name\":\"facedet-gpu\",\"replicas\":1}]}}\n"},"creationTimestamp":"2019-05-26T15:40:22Z","generation":1.0,"labels":{"app":"seldon","ksonnet.io/component":"facedet"},"name":"facedet-gpu","namespace":"modelzoo","resourceVersion":"4317220","selfLink":"/apis/machinelearning.seldon.io/v1alpha2/namespaces/modelzoo/seldondeployments/facedet-gpu","uid":"931f9605-7fcc-11e9-a912-408d5c260149"},"spec":{"annotations":{"deployment_version":"v1","project_name":"facedet","seldon.io/grpc-read-timeout":"60000","seldon.io/rest-connection-timeout":"60000","seldon.io/rest-read-timeout":"60000"},"name":"facedet-gpu","oauth_key":"","oauth_secret":"","predictors":[{"annotations":{"predictor_version":"v1"},"componentSpecs":[{"spec":{"containers":[{"image":"xiaoyang0117/facedet-gpu:v0.1","imagePullPolicy":"IfNotPresent","name":"facedet","resources":{"limits":{"alpha.kubernetes.io/nvidia-gpu":1.0}},"volumeMounts":[{"mountPath":"/usr/local/nvidia/bin","name":"bin"},{"mountPath":"/usr/lib/nvidia","name":"lib"}]}],"imagePullSecrets":[{"name":"regcred"}],"terminationGracePeriodSeconds":1.0,"volumes":[{"hostPath":{"path":"/usr/lib/nvidia-384/bin"},"name":"bin"},{"hostPath":{"path":"/usr/lib/nvidia-384"},"name":"lib"}]}}],"graph":{"children":[],"endpoint":{"type":"GRPC"},"name":"facedet","type":"MODEL"},"name":"facedet-gpu","replicas":1.0}]}}
 2019-05-26 15:41:08.685 ERROR 1 --- [pool-3-thread-1] o.s.s.s.TaskUtils$LoggingErrorHandler    : Unexpected error occurred in scheduled task.
 com.google.protobuf.InvalidProtocolBufferException: Can't decode io.kubernetes.client.proto.resource.Quantity from 1.0
    at io.seldon.apife.pb.QuantityUtils$QuantityParser.merge(QuantityUtils.java:63) ~[classes!/:0.2.5]
    at io.seldon.apife.pb.JsonFormat$ParserImpl.merge(JsonFormat.java:1241) ~[classes!/:0.2.5]
    at io.seldon.apife.pb.JsonFormat$ParserImpl.parseFieldValue(JsonFormat.java:1797) ~[classes!/:0.2.5]
    at io.seldon.apife.pb.JsonFormat$ParserImpl.mergeMapField(JsonFormat.java:1484) ~[classes!/:0.2.5]
    at io.seldon.apife.pb.JsonFormat$ParserImpl.mergeField(JsonFormat.java:1458) ~[classes!/:0.2.5]
    at io.seldon.apife.pb.JsonFormat$ParserImpl.mergeMessage(JsonFormat.java:1294) ~[classes!/:0.2.5]
    at io.seldon.apife.pb.JsonFormat$ParserImpl.merge(JsonFormat.java:1252) ~[classes!/:0.2.5]
    at io.seldon.apife.pb.JsonFormat$ParserImpl.parseFieldValue(JsonFormat.java:1797) ~[classes!/:0.2.5]
ukclivecox commented 5 years ago

Are you able to try this with the latest from master?

muma378 commented 5 years ago

Are you able to try this with the latest from master?

Not yet, the latest version is really hard to be installed for the Ambassador stuff, so I followed the guide in the example which uses version 0.2.5. Do you mean, if I understood correctly, the changes were made in the latest versions?

muma378 commented 5 years ago

When I say "blocked in parsing stage", I mean I can saw the deployment name with kubectl get sdep -n namespace but got nothing with kubectl get deploy -n namespace.

ukclivecox commented 5 years ago

If using 0.2.5 can you check the logs of the cluster-manager?

What problems are you having with Ambassador. In master you would install the official Ambassador helm chart.

The issue you are having I think is due to parsing of Quantity in the protobuffer specs. This should be fixed in the version in master which was why I was hoping you could test with latest?

muma378 commented 5 years ago

Yes, you are correct, I just found a similar scenario in issue #45 . I checked the cluster-manager, it is for the Quantity parsing. Therefore, I changed the value to "1" like:

    - spec:
        containers:
        - image: my-image-name:v0.1
          resources:
            limits:
              nvidia.com/gpu: "1"

However, the cluster-manager reports another error:

2019-05-27 16:19:27.156 DEBUG 1 --- [pool-1-thread-1] i.s.c.k8s.KubeCRDHandlerImpl             : Updating seldondeployment facedet-gpu with status state: "Failed"
description: "Can\'t find container for predictive unit with name facedet-gpu"
 2019-05-27 16:19:27.307 DEBUG 1 --- [pool-1-thread-1] i.s.c.k8s.SeldonDeploymentWatcher        : MODIFIED

Looks like unable to find the image, but apparently it was hosted. What else could be make this happen?

ukclivecox commented 5 years ago

The name in the graph spec must match a container name. Looks like it can't find a container with name facedet-gpu

muma378 commented 5 years ago

Exactly! Changing the container name resolves my problem. Now I can see a deployment is being created. @cliveseldon Thanks very much for your patient! At last, back to my confusion, technically seldon-core is OK with container using hardware accelerate , right?

ukclivecox commented 5 years ago

There should no issue. As long as your model image and Pod is correctly setup. We'd love to have an example in this area so happy to help you get everything working.