AliyunContainerService / gpushare-scheduler-extender

GPU Sharing Scheduler for Kubernetes Cluster
Apache License 2.0
1.39k stars 308 forks source link

gpushare-device plugin daemonset is not working #153

Closed riyasoni5990 closed 3 years ago

riyasoni5990 commented 3 years ago

Hii There, I really appreciate your solution on GPU sharing in Kubernetes. I followed all steps to set up this scheduler extender but getting some issue with the GPU share device plugin daemonset it is not working and its desired number is also zero, unable to describe that daemonset and there is no pod created in the backend by daemonset. Screenshot from 2021-05-24 13-42-09

wsxiaozhang commented 3 years ago

Hi @riyasoni5990, do you check your gpu node label, and rbac config for gpushare-device-plugin ?

riyasoni5990 commented 3 years ago

Screenshot from 2021-05-26 13-20-10

riyasoni5990 commented 3 years ago

Is it necessary to use Tesla k80 GPU or I can use the Tesla p100 too?

wsxiaozhang commented 3 years ago

Is it necessary to use Tesla k80 GPU or I can use the Tesla p100 too?

yes, it can work on p100

wsxiaozhang commented 3 years ago

are you sure your k8s cluster just have one master node which has gpu device on it?

riyasoni5990 commented 3 years ago

Screenshot from 2021-05-30 16-22-40 Screenshot from 2021-05-30 16-22-56 Screenshot from 2021-05-30 16-23-48\

Everything is working fine now(device plugin, scheduler extender) but when I am trying kubectl inspect gpushare command it is showing. Allocated/Total GPU Memory In Cluster: 0/0 (0%)

I am using :- GPU - Tesla P100 kubernetes version - v1.21.1 Driver Version - 465.19.01

Please help me to resolve this.

wsxiaozhang commented 3 years ago

so far, according to your latest screenshots, the deviceplugin has run, but cannot discover any gpu device on testserver02. So,

  1. firstly, pls make sure your gpu node work normally. you can exec nvidia-smi on node to verify.
  2. could you double check your gpu node environment to meet the prerequisites of nvidia docker runtime? Nvidia-docker version > 2.0 (see how to install and it's prerequisites) Docker configured with Nvidia as the default runtime.
  3. if all those environment are fine, do you ever successfully run a gpu pod using K8s default scheduler and nvidia's device plugin which allocates gpu device to pod exclusively?
riyasoni5990 commented 3 years ago

Thank You for your support, everything is working fine now.

wsxiaozhang commented 3 years ago

Thank You for your support, everything is working fine now.

Good to know. Gonna close the issue.