Closed muma378 closed 5 years ago
Can you provide more details on the error at "parsing" stage? Are you using the "master" branch or an earlier version of Seldon Core?
Can you provide more details on the error at "parsing" stage? Are you using the "master" branch or an earlier version of Seldon Core?
No, the version for seldon-core and seldon-core-crd both are 0.2.5. Installed with helm locally.
The seldon-core-apiserver reports an error below:
2019-05-26 15:41:08.623 INFO 1 --- [pool-3-thread-1] io.seldon.apife.k8s.DeploymentWatcher : The time is now 15:41:08
2019-05-26 15:41:08.623 INFO 1 --- [pool-3-thread-1] io.seldon.apife.k8s.DeploymentWatcher : Watching with rs 3980232
2019-05-26 15:41:08.682 INFO 1 --- [pool-3-thread-1] io.seldon.apife.k8s.DeploymentWatcher : ADDED
: {"apiVersion":"machinelearning.seldon.io/v1alpha2","kind":"SeldonDeployment","metadata":{"annotations":{"kubectl.kubernetes.io/last-applied-configuration":"{\"apiVersion\":\"machinelearning.seldon.io/v1alpha2\",\"kind\":\"SeldonDeployment\",\"metadata\":{\"annotations\":{},\"labels\":{\"app\":\"seldon\",\"ksonnet.io/component\":\"facedet\"},\"name\":\"facedet-gpu\",\"namespace\":\"modelzoo\"},\"spec\":{\"annotations\":{\"deployment_version\":\"v1\",\"project_name\":\"facedet\",\"seldon.io/grpc-read-timeout\":\"60000\",\"seldon.io/rest-connection-timeout\":\"60000\",\"seldon.io/rest-read-timeout\":\"60000\"},\"name\":\"facedet-gpu\",\"oauth_key\":\"\",\"oauth_secret\":\"\",\"predictors\":[{\"annotations\":{\"predictor_version\":\"v1\"},\"componentSpecs\":[{\"spec\":{\"containers\":[{\"image\":\"facedet-gpu:v0.1\",\"imagePullPolicy\":\"IfNotPresent\",\"name\":\"facedet\",\"resources\":{\"limits\":{\"alpha.kubernetes.io/nvidia-gpu\":1}},\"volumeMounts\":[{\"mountPath\":\"/usr/local/nvidia/bin\",\"name\":\"bin\"},{\"mountPath\":\"/usr/lib/nvidia\",\"name\":\"lib\"}]}],\"imagePullSecrets\":[{\"name\":\"regcred\"}],\"terminationGracePeriodSeconds\":1,\"volumes\":[{\"hostPath\":{\"path\":\"/usr/lib/nvidia-384/bin\"},\"name\":\"bin\"},{\"hostPath\":{\"path\":\"/usr/lib/nvidia-384\"},\"name\":\"lib\"}]}}],\"graph\":{\"children\":[],\"endpoint\":{\"type\":\"GRPC\"},\"name\":\"facedet\",\"type\":\"MODEL\"},\"name\":\"facedet-gpu\",\"replicas\":1}]}}\n"},"creationTimestamp":"2019-05-26T15:40:22Z","generation":1.0,"labels":{"app":"seldon","ksonnet.io/component":"facedet"},"name":"facedet-gpu","namespace":"modelzoo","resourceVersion":"4317220","selfLink":"/apis/machinelearning.seldon.io/v1alpha2/namespaces/modelzoo/seldondeployments/facedet-gpu","uid":"931f9605-7fcc-11e9-a912-408d5c260149"},"spec":{"annotations":{"deployment_version":"v1","project_name":"facedet","seldon.io/grpc-read-timeout":"60000","seldon.io/rest-connection-timeout":"60000","seldon.io/rest-read-timeout":"60000"},"name":"facedet-gpu","oauth_key":"","oauth_secret":"","predictors":[{"annotations":{"predictor_version":"v1"},"componentSpecs":[{"spec":{"containers":[{"image":"xiaoyang0117/facedet-gpu:v0.1","imagePullPolicy":"IfNotPresent","name":"facedet","resources":{"limits":{"alpha.kubernetes.io/nvidia-gpu":1.0}},"volumeMounts":[{"mountPath":"/usr/local/nvidia/bin","name":"bin"},{"mountPath":"/usr/lib/nvidia","name":"lib"}]}],"imagePullSecrets":[{"name":"regcred"}],"terminationGracePeriodSeconds":1.0,"volumes":[{"hostPath":{"path":"/usr/lib/nvidia-384/bin"},"name":"bin"},{"hostPath":{"path":"/usr/lib/nvidia-384"},"name":"lib"}]}}],"graph":{"children":[],"endpoint":{"type":"GRPC"},"name":"facedet","type":"MODEL"},"name":"facedet-gpu","replicas":1.0}]}}
2019-05-26 15:41:08.685 ERROR 1 --- [pool-3-thread-1] o.s.s.s.TaskUtils$LoggingErrorHandler : Unexpected error occurred in scheduled task.
com.google.protobuf.InvalidProtocolBufferException: Can't decode io.kubernetes.client.proto.resource.Quantity from 1.0
at io.seldon.apife.pb.QuantityUtils$QuantityParser.merge(QuantityUtils.java:63) ~[classes!/:0.2.5]
at io.seldon.apife.pb.JsonFormat$ParserImpl.merge(JsonFormat.java:1241) ~[classes!/:0.2.5]
at io.seldon.apife.pb.JsonFormat$ParserImpl.parseFieldValue(JsonFormat.java:1797) ~[classes!/:0.2.5]
at io.seldon.apife.pb.JsonFormat$ParserImpl.mergeMapField(JsonFormat.java:1484) ~[classes!/:0.2.5]
at io.seldon.apife.pb.JsonFormat$ParserImpl.mergeField(JsonFormat.java:1458) ~[classes!/:0.2.5]
at io.seldon.apife.pb.JsonFormat$ParserImpl.mergeMessage(JsonFormat.java:1294) ~[classes!/:0.2.5]
at io.seldon.apife.pb.JsonFormat$ParserImpl.merge(JsonFormat.java:1252) ~[classes!/:0.2.5]
at io.seldon.apife.pb.JsonFormat$ParserImpl.parseFieldValue(JsonFormat.java:1797) ~[classes!/:0.2.5]
Are you able to try this with the latest from master?
Are you able to try this with the latest from master?
Not yet, the latest version is really hard to be installed for the Ambassador stuff, so I followed the guide in the example which uses version 0.2.5. Do you mean, if I understood correctly, the changes were made in the latest versions?
When I say "blocked in parsing stage", I mean I can saw the deployment name with kubectl get sdep -n namespace
but got nothing with kubectl get deploy -n namespace
.
If using 0.2.5 can you check the logs of the cluster-manager?
What problems are you having with Ambassador. In master you would install the official Ambassador helm chart.
The issue you are having I think is due to parsing of Quantity in the protobuffer specs. This should be fixed in the version in master which was why I was hoping you could test with latest?
Yes, you are correct, I just found a similar scenario in issue #45 . I checked the cluster-manager, it is for the Quantity parsing. Therefore, I changed the value to "1" like:
- spec:
containers:
- image: my-image-name:v0.1
resources:
limits:
nvidia.com/gpu: "1"
However, the cluster-manager reports another error:
2019-05-27 16:19:27.156 DEBUG 1 --- [pool-1-thread-1] i.s.c.k8s.KubeCRDHandlerImpl : Updating seldondeployment facedet-gpu with status state: "Failed"
description: "Can\'t find container for predictive unit with name facedet-gpu"
2019-05-27 16:19:27.307 DEBUG 1 --- [pool-1-thread-1] i.s.c.k8s.SeldonDeploymentWatcher : MODIFIED
Looks like unable to find the image, but apparently it was hosted. What else could be make this happen?
The name in the graph
spec must match a container name. Looks like it can't find a container with name facedet-gpu
Exactly! Changing the container name resolves my problem. Now I can see a deployment is being created. @cliveseldon Thanks very much for your patient! At last, back to my confusion, technically seldon-core is OK with container using hardware accelerate , right?
There should no issue. As long as your model image and Pod is correctly setup. We'd love to have an example in this area so happy to help you get everything working.
Hi, I was trying to deploy a
SeldonDeployment
to the cluster, which asks for gpu resource and CUDA. I writes the .yaml as the official doc suggests, however the deployment was blocked at "parsing" the CRD stage, which results in no deployment or service was created. It was totally OK to deploy models without gpu required.I didn't find any example on using gpus, so, my question is: does Seldon-core support GPU? Or does anyone has succeeded in deploying a model with gpu required?
This is a part from my .yaml: