2 questions about setting GPU and scheduling GPU on k8s

Dear all,

Recently I just dived into k8s this popular technique. As I request the GPU of pod to implement some DL tasks but I get confused about setting GPU and scheduling GPU.

I was using the microk8s to create a cluster and pods. Microk8s is very handy for users to enable relevant packages including kubeflow, gpu, etc.

I am wondering if I use the microk8s to enable gpu, do I still have to particularly install the k8s-device-plugin of Nvidia manually?
There is only one GPU device in this node, and I was trying to create a pod with GPU that I followed the instructions from k8s official website for testing. However, I encountered the known issue below.

microk8s version : microk8s --channel=1.21/beta --classic
host os : ubuntu 20.04
gpu : RTX 3060 (12G)
host gpu driver : 460.73.01

The content of yaml :

apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  restartPolicy: OnFailure
  containers:
  - image: nvcr.io/nvidia/cuda:11.2.2-devel-ubuntu18.04
    name: cuda
    resources:
      limits:
         nvidia.com/gpu: 1
         memory: "1G"

$ microk8s.kubectl create -f gpu_test.yaml
pod/gpu-pod created

$ microk8s.kubectl get pods 
NAME                                                         READY   STATUS    RESTARTS   AGE
gpu-operator-node-feature-discovery-master-dcf999dc8-p7s64   1/1     Running   0          58m
gpu-operator-node-feature-discovery-worker-mlcpt             1/1     Running   0          58m
gpu-operator-64df558567-xx6sx                                1/1     Running   0          58m
gpu-pod                                                      0/1     Pending   0          2m21s

$ microk8s.kubectl describe pods gpu-pod
Name:         gpu-pod
Namespace:    default
Priority:     0
Node:         <none>
Labels:       <none>
Annotations:  <none>
Status:       Pending
IP:           
IPs:          <none>
Containers:
  cuda:
    Image:      nvcr.io/nvidia/cuda:11.2.2-devel-ubuntu18.04
    Port:       <none>
    Host Port:  <none>
    Limits:
      memory:          1G
      nvidia.com/gpu:  1
    Requests:
      memory:          1G
      nvidia.com/gpu:  1
    Environment:       <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-lrhb7 (ro)
Conditions:
  Type           Status
  PodScheduled   False 
Volumes:
  kube-api-access-lrhb7:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason            Age    From               Message
  ----     ------            ----   ----               -------
  Warning  FailedScheduling  3m46s  default-scheduler  0/1 nodes are available: 1 Insufficient nvidia.com/gpu.
  Warning  FailedScheduling  3m45s  default-scheduler  0/1 nodes are available: 1 Insufficient nvidia.com/gpu.

I tested many kind of yaml, but all got the same issue. Hence, I am wondering does the GPU plugin of k8s support only one GPU device and this GPU that is not fully freedom? Scheduling GPU is very important to me because I wanna implement the TensorRT and other deep learning tasks inside pods.

I would like to provide more detail information if there is any place not clear.

Thank you so much!

NVIDIA / k8s-device-plugin

2 questions about setting GPU and scheduling GPU on k8s #246