Project-HAMi / HAMi

Heterogeneous AI Computing Virtualization Middleware
http://project-hami.io/
Apache License 2.0
934 stars 195 forks source link

container with privilege context failed to be schedulered #611

Open gongysh2004 opened 2 days ago

gongysh2004 commented 2 days ago

What happened: container with privilege context failed to be scheduled What you expected to happen: should be scheduled How to reproduce it (as minimally and precisely as possible): install the hami according to the install steps, then run the following deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: gpu-test
  labels:
    app: gpu-test
spec:
  replicas: 3
  selector:
    matchLabels:
      app: gpu-test
  template:
    metadata:
      labels:
        app: gpu-test
    spec:
      containers:
      - name: gpu-test
        securityContext:
          privileged: true
        image: ubuntu:18.04
        resources:
          limits:
            nvidia.com/gpu: 2 # requesting 2 vGPUs
            nvidia.com/gpumem: 10240
        command: ["/bin/sh", "-c"]
        args: ["while true; do cat /mnt/data/test.txt; sleep 5; done"]
        volumeMounts:
        - mountPath: "/mnt/data"
          name: data-volume
      volumes:
      - name: data-volume
        hostPath:
          path: /opt/data
          type: Directory

Anything else we need to know?:

- The kubelet logs on the node (e.g: `sudo journalctl -r -u kubelet`)
- Any relevant kernel output lines from `dmesg`

**Environment**:
- HAMi version:

root@node7vm-1:~/test# helm ls -A | grep hami hami kube-system 2 2024-11-14 15:18:36.886955318 +0800 CST deployed hami-2.4.0 2.4.0
my-hami-webui kube-system 4 2024-11-14 17:18:24.678439025 +0800 CST deployed hami-webui-1.0.3 1.0.3

- nvidia driver or other AI device driver version:

root@node7bm-1:~# nvidia-smi Thu Nov 14 15:58:33 2024
+---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.161.08 Driver Version: 535.161.08 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA L40S On | 00000000:08:00.0 Off | Off | | N/A 27C P8 22W / 350W | 0MiB / 49140MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 1 NVIDIA L40S On | 00000000:09:00.0 Off | Off | | N/A 28C P8 21W / 350W | 0MiB / 49140MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 2 NVIDIA L40S On | 00000000:0E:00.0 Off | Off | | N/A 26C P8 19W / 350W | 0MiB / 49140MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 3 NVIDIA L40S On | 00000000:11:00.0 Off | Off | | N/A 26C P8 21W / 350W | 0MiB / 49140MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 4 NVIDIA L40S On | 00000000:87:00.0 Off | Off | | N/A 26C P8 21W / 350W | 0MiB / 49140MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 5 NVIDIA L40S On | 00000000:8D:00.0 Off | Off | | N/A 26C P8 21W / 350W | 0MiB / 49140MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 6 NVIDIA L40S On | 00000000:90:00.0 Off | Off | | N/A 26C P8 21W / 350W | 0MiB / 49140MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 7 NVIDIA L40S On | 00000000:91:00.0 Off | Off | | N/A 27C P8 19W / 350W | 0MiB / 49140MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+

- Docker version from `docker version`
- Docker command, image and tag used
- Kernel version from `uname -a`

Linux node7vm-1 5.15.0-125-generic #135-Ubuntu SMP Fri Sep 27 13:53:58 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

- Others:

if I don't request gpu mem:

Events: Type Reason Age From Message


Normal Scheduled 19s default-scheduler Successfully assigned default/gpu-test-5f9f7d48d9-4wsrp to node7bm-1 Warning UnexpectedAdmissionError 20s kubelet Allocate failed due to rpc error: code = Unknown desc = no binding pod found on node node7bm-1, which is unexpected

if I request gpu mem:

Events: Type Reason Age From Message


Warning FailedScheduling 14s default-scheduler 0/3 nodes are available: 1 node(s) had untolerated taint {node-role.kubernetes.io/control-plane: }, 2 Insufficient nvidia.com/gpumem. preemption: 0/3 nodes are available: 1 Preemption is not helpful for scheduling, 2 No preemption victims found for incoming pod..

Nimbus318 commented 2 days ago

Privileged Pods have direct access to the host's devices—they share the host's device namespace and can directly access everything under the /dev directory. This basically bypasses the container's device isolation

So, in our HAMi webhook:

    if ctr.SecurityContext.Privileged != nil && *ctr.SecurityContext.Privileged {
        klog.Warningf(template+" - Denying admission as container %s is privileged", req.Namespace, req.Name, req.UID, c.Name)
        continue
    }

the code just skips handling privileged Pods altogether, which means they fall back to being scheduled by the default scheduler. You can see from the Events you posted that it's scheduled by the default-scheduler

So, the reason scheduling fails when resources.limits includes nvidia.com/gpumem is that the default-scheduler doesn’t recognize nvidia.com/gpumem