NVIDIA / k8s-device-plugin

NVIDIA device plugin for Kubernetes
Apache License 2.0
2.83k stars 629 forks source link

The MPS container has started running, but cannot call GPU resources inside the container #805

Open xiaoxiaoboyyds opened 4 months ago

xiaoxiaoboyyds commented 4 months ago

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.

1. Quick Debug Information

2. Issue or feature description

After successfully deploying nvidia device plugin using the command, I also successfully made the GPU 10 schedulable. However, when using MPS mode in the container, YOLO found that it was unable to successfully call the GPU when calling resources. But timeSlicing is actually possible。 Do I need to enable certain functions? I have already opened “nvidia cuda mps control - d”, and using “nvidia smi” in the container can also view GPU resources

3. Information to attach (optional if deemed irrelevant)

root@VM-16-14-ubuntu:/dev/shm# nvidia-smi
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.07             Driver Version: 535.161.07   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla V100-SXM2-32GB           On  | 00000000:00:08.0 Off |                    0 |
| N/A   35C    P0              23W / 300W |     32MiB / 32768MiB |      0%   E. Process |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A     93272      C   nvidia-cuda-mps-server                       30MiB |
+---------------------------------------------------------------------------------------+
helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
helm repo update
cat << EOF > /tmp/dp-config.yaml
version: v1
sharing:
  mps:
    resources:
    - name: nvidia.com/gpu
      replicas: 10
EOF
helm install nvidia-device-plugin nvdp/nvidia-device-plugin \
  --namespace nvidia-device-plugin \
  --create-namespace \
  --set tolerations[0].key=node-role.kubernetes.io/edge \
  --set tolerations[0].operator=Exists \
  --set tolerations[0].effect=NoSchedule \
  --set tolerations[1].key=nvidia.com/gpu \
  --set tolerations[1].operator=Exists \
  --set tolerations[1].effect=NoSchedule \
  --set-file config.map.config=/tmp/dp-config.yaml

/

root@master01:/home/ubuntu# kubectl get pods -A
NAMESPACE              NAME                                            READY   STATUS             RESTARTS        AGE
nvidia-device-plugin   nvidia-device-plugin-5cdkb                      2/2     Running            0               69m
nvidia-device-plugin   nvidia-device-plugin-mps-control-daemon-zr4hj   2/2     Running            0               69m
.......
.....
root@master01:/home/ubuntu#  kubectl describe nodes edgenode-test
Name:               edgenode-test
Roles:              agent,edge
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    feature.node.kubernetes.io/cpu-model.vendor_id=NVIDIA
                    feature.node.kubernetes.io/pci-10de.present=true
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=edgenode-test
                    kubernetes.io/os=linux
                    node-role.kubernetes.io/agent=
                    node-role.kubernetes.io/edge=
                    nos.nebuly.com/gpu-partitioning=mps
                    nvidia.com/gpu.present=true
                    nvidia.com/mps.capable=true
Annotations:        node.alpha.kubernetes.io/ttl: 0
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Fri, 05 Jul 2024 18:09:42 +0800
Taints:             <none>
Unschedulable:      false
Lease:
  HolderIdentity:  edgenode-test
  AcquireTime:     <unset>
  RenewTime:       Mon, 08 Jul 2024 17:56:17 +0800
Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  MemoryPressure   False   Mon, 08 Jul 2024 17:54:25 +0800   Mon, 08 Jul 2024 15:06:35 +0800   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Mon, 08 Jul 2024 17:54:25 +0800   Mon, 08 Jul 2024 15:06:35 +0800   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Mon, 08 Jul 2024 17:54:25 +0800   Mon, 08 Jul 2024 15:06:35 +0800   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            True    Mon, 08 Jul 2024 17:54:25 +0800   Mon, 08 Jul 2024 15:06:35 +0800   EdgeReady                    edge is posting ready status. AppArmor enabled
Addresses:
  InternalIP:  119.45.165.216
  Hostname:    edgenode-test
Capacity:
  cpu:                    10
  ephemeral-storage:      489087124Ki
  hugepages-1Gi:          0
  hugepages-2Mi:          0
  memory:                 40284092Ki
  nvidia.com/gpu:         10
  nvidia.com/gpu.shared:  0
  pods:                   110
Allocatable:
  cpu:                    10
  ephemeral-storage:      450742692733
  hugepages-1Gi:          0
  hugepages-2Mi:          0
  memory:                 40181692Ki
  nvidia.com/gpu:         10
  nvidia.com/gpu.shared:  0
  pods:                   110
System Info:
  Machine ID:                 36721b810c324ab782c87a701e59cb09
  System UUID:                36721b81-0c32-4ab7-82c8-7a701e59cb09
  Boot ID:                    86d43f47-cd9f-451a-9d44-fcb9ceb70d73
  Kernel Version:             5.15.0-113-generic
  OS Image:                   Ubuntu 22.04 LTS
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  containerd://1.7.18
  Kubelet Version:            v1.28.6-kubeedge-v1.17.0
  Kube-Proxy Version:         v0.0.0-master+$Format:%H$
PodCIDR:                      192.168.21.0/24
PodCIDRs:                     192.168.21.0/24
Non-terminated Pods:          (6 in total)
  Namespace                   Name                                             CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----                                             ------------  ----------  ---------------  -------------  ---
  edge                        video-57f9659fb9-gpttn                           0 (0%)        0 (0%)      0 (0%)           0 (0%)         19m
  edge                        video-b58c6685c-gzg2g                            0 (0%)        0 (0%)      0 (0%)           0 (0%)         19m
  kubeedge                    cloud-iptables-manager-k9v9c                     100m (1%)     200m (2%)   25Mi (0%)        50Mi (0%)      2d23h
  kubeedge                    edge-eclipse-mosquitto-6c2mm                     100m (1%)     200m (2%)   50Mi (0%)        100Mi (0%)     2d23h
  nvidia-device-plugin        nvidia-device-plugin-5cdkb                       0 (0%)        0 (0%)      0 (0%)           0 (0%)         70m
  nvidia-device-plugin        nvidia-device-plugin-mps-control-daemon-zr4hj    0 (0%)        0 (0%)      0 (0%)           0 (0%)         70m
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource               Requests   Limits
  --------               --------   ------
  cpu                    200m (2%)  400m (4%)
  memory                 75Mi (0%)  150Mi (0%)
  ephemeral-storage      0 (0%)     0 (0%)
  hugepages-1Gi          0 (0%)     0 (0%)
  hugepages-2Mi          0 (0%)     0 (0%)
  nvidia.com/gpu         2          2
  nvidia.com/gpu.shared  0          0
Events:                  <none>
xiaoxiaoboyyds commented 4 months ago

My Deployment:

---
apiVersion: apps/v1
kind: Deployment
metadata:
  annotations: {}
  labels:
    k8s.kuboard.cn/layer: svc
    k8s.kuboard.cn/name: video
  name: video
  namespace: edge
  resourceVersion: '1597692'
spec:
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 1
  selector:
    matchLabels:
      k8s.kuboard.cn/layer: svc
      k8s.kuboard.cn/name: video
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      annotations:
        kubectl.kubernetes.io/restartedAt: '2024-07-08T17:37:11+08:00'
      creationTimestamp: null
      labels:
        k8s.kuboard.cn/layer: svc
        k8s.kuboard.cn/name: video
    spec:
      containers:
        - image: 'harbor.moolink.net/moolink/video-supervision:v1.2'
          imagePullPolicy: IfNotPresent
          name: video
          resources:
            limits:
              nvidia.com/gpu: '1'
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          volumeMounts:
            - mountPath: /dev/shm
              name: shm
      dnsPolicy: ClusterFirst
      nodeName: edgenode-test
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30
      volumes:
        - hostPath:
            path: /dev/shm
            type: Directory
          name: shm
xiaoxiaoboyyds commented 4 months ago

My Deployment:

---
apiVersion: apps/v1
kind: Deployment
metadata:
  annotations: {}
  labels:
    k8s.kuboard.cn/layer: svc
    k8s.kuboard.cn/name: video
  name: video
  namespace: edge
  resourceVersion: '1597692'
spec:
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 1
  selector:
    matchLabels:
      k8s.kuboard.cn/layer: svc
      k8s.kuboard.cn/name: video
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      annotations:
        kubectl.kubernetes.io/restartedAt: '2024-07-08T17:37:11+08:00'
      creationTimestamp: null
      labels:
        k8s.kuboard.cn/layer: svc
        k8s.kuboard.cn/name: video
    spec:
      containers:
        - image: 'harbor.moolink.net/moolink/video-supervision:v1.2'
          imagePullPolicy: IfNotPresent
          name: video
          resources:
            limits:
              nvidia.com/gpu: '1'
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          volumeMounts:
            - mountPath: /dev/shm
              name: shm
      dnsPolicy: ClusterFirst
      nodeName: edgenode-test
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30
      volumes:
        - hostPath:
            path: /dev/shm
            type: Directory
          name: shm
topikachu commented 3 months ago

Meet a similar issue and find a comment at https://github.com/NVIDIA/k8s-device-plugin/issues/467#issuecomment-1974252052 Remove /dev/shm from your deployment and try again. Could you report back if this works or not?

My Deployment:

---
apiVersion: apps/v1
kind: Deployment
metadata:
  annotations: {}
  labels:
    k8s.kuboard.cn/layer: svc
    k8s.kuboard.cn/name: video
  name: video
  namespace: edge
  resourceVersion: '1597692'
spec:
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 1
  selector:
    matchLabels:
      k8s.kuboard.cn/layer: svc
      k8s.kuboard.cn/name: video
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      annotations:
        kubectl.kubernetes.io/restartedAt: '2024-07-08T17:37:11+08:00'
      creationTimestamp: null
      labels:
        k8s.kuboard.cn/layer: svc
        k8s.kuboard.cn/name: video
    spec:
      containers:
        - image: 'harbor.moolink.net/moolink/video-supervision:v1.2'
          imagePullPolicy: IfNotPresent
          name: video
          resources:
            limits:
              nvidia.com/gpu: '1'
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          volumeMounts:
            - mountPath: /dev/shm
              name: shm
      dnsPolicy: ClusterFirst
      nodeName: edgenode-test
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30
      volumes:
        - hostPath:
            path: /dev/shm
            type: Directory
          name: shm
chipzoller commented 3 months ago

~Running with MPS requires you set hostPID: true in the Pod spec field which I don't see you've done. I suspect this would resolve the issue.~ Correction: This is only required in GKE.

jaffe-fly commented 3 months ago

why after

helm install nvidia-device-plugin nvdp/nvidia-device-plugin \
  --namespace nvidia-device-plugin \
  --create-namespace \
  --set-file config.map.config=config.yaml

no a nvidia-device-plugin-mps-control-daemon pod? kubectl get pod -n nvidia-device-plugin

NAME                              READY   STATUS    RESTARTS   AGE
nvdp-nvidia-device-plugin-fmhlh   2/2     Running   0          21m

and nvidia.com/gpu: 0

Capacity:
  cpu:                16
  ephemeral-storage:  309506092Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             131500512Ki
  nvidia.com/gpu:     0
  pods:               110
Allocatable:
  cpu:                16
  ephemeral-storage:  285240813915
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             131193312Ki
  nvidia.com/gpu:     0
  pods:               110

use nvidia-device-plugin-0.16.1

ettelr commented 3 months ago

Hi any update on mps shm configurable in last releases? we cannot use it like this each of workload is using different shm

github-actions[bot] commented 1 week ago

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.