AliyunContainerService / gpushare-scheduler-extender

GPU Sharing Scheduler for Kubernetes Cluster
Apache License 2.0
1.36k stars 303 forks source link

Wrong GPU ID #191

Closed tintranvan closed 1 year ago

tintranvan commented 1 year ago

I just had updated GPU Cards in my Servers as below, I have 3 servers with different GPU cards now

image

When I am trying to create a new pod, the pod has been assigned to Server 2 with GPU ID = 1 although GPU 1 (0/0 Allocated) is not existing in Server 2 so my deployment is failed, i got this error

 Error response from daemon: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy' nvidia-container-cli: device error: no-gpu-has-9MiB-to-run: unknown device: unknown

Why didn't pod assigned to Server 2 with GPU ID = 0 ?

How can I resolve this issue ?

Thanks so much

swartz-k commented 1 year ago

Hello, @tintranvan can you take a look at the version you are using?

antikilahdjs commented 1 year ago

I guess it is about some logical issues, here we needed change the logical to distribute it correctly but till under development

captainsk7 commented 1 year ago

@antikilahdjs Have you resolved this issue ? I'm facing the same issue...

tintranvan commented 1 year ago

Hello,

I resolved this problem by restarting GPU Sharing DaemonSet to update new GPU Card Information. Anything update, we should restart DaemonSet to update new information

Best Regards