container with privilege context failed to be schedulered

What happened: container with privilege context failed to be scheduled What you expected to happen: should be scheduled How to reproduce it (as minimally and precisely as possible): install the hami according to the install steps, then run the following deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: gpu-test
  labels:
    app: gpu-test
spec:
  replicas: 3
  selector:
    matchLabels:
      app: gpu-test
  template:
    metadata:
      labels:
        app: gpu-test
    spec:
      containers:
      - name: gpu-test
        securityContext:
          privileged: true
        image: ubuntu:18.04
        resources:
          limits:
            nvidia.com/gpu: 2 # requesting 2 vGPUs
            nvidia.com/gpumem: 10240
        command: ["/bin/sh", "-c"]
        args: ["while true; do cat /mnt/data/test.txt; sleep 5; done"]
        volumeMounts:
        - mountPath: "/mnt/data"
          name: data-volume
      volumes:
      - name: data-volume
        hostPath:
          path: /opt/data
          type: Directory

Anything else we need to know?:

The output of nvidia-smi -a on your host
Your docker or containerd configuration file (e.g: /etc/docker/daemon.json)
The hami-device-plugin container logs
The hami-scheduler container logs

- The kubelet logs on the node (e.g: `sudo journalctl -r -u kubelet`)
- Any relevant kernel output lines from `dmesg`

**Environment**:
- HAMi version:

root@node7vm-1:~/test# helm ls -A | grep hami hami kube-system 2 2024-11-14 15:18:36.886955318 +0800 CST deployed hami-2.4.0 2.4.0
my-hami-webui kube-system 4 2024-11-14 17:18:24.678439025 +0800 CST deployed hami-webui-1.0.3 1.0.3

- nvidia driver or other AI device driver version:

root@node7bm-1:~# nvidia-smi Thu Nov 14 15:58:33 2024
+---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.161.08 Driver Version: 535.161.08 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA L40S On | 00000000:08:00.0 Off | Off | | N/A 27C P8 22W / 350W | 0MiB / 49140MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 1 NVIDIA L40S On | 00000000:09:00.0 Off | Off | | N/A 28C P8 21W / 350W | 0MiB / 49140MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 2 NVIDIA L40S On | 00000000:0E:00.0 Off | Off | | N/A 26C P8 19W / 350W | 0MiB / 49140MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 3 NVIDIA L40S On | 00000000:11:00.0 Off | Off | | N/A 26C P8 21W / 350W | 0MiB / 49140MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 4 NVIDIA L40S On | 00000000:87:00.0 Off | Off | | N/A 26C P8 21W / 350W | 0MiB / 49140MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 5 NVIDIA L40S On | 00000000:8D:00.0 Off | Off | | N/A 26C P8 21W / 350W | 0MiB / 49140MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 6 NVIDIA L40S On | 00000000:90:00.0 Off | Off | | N/A 26C P8 21W / 350W | 0MiB / 49140MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 7 NVIDIA L40S On | 00000000:91:00.0 Off | Off | | N/A 27C P8 19W / 350W | 0MiB / 49140MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+

- Docker version from `docker version`
- Docker command, image and tag used
- Kernel version from `uname -a`

Linux node7vm-1 5.15.0-125-generic #135-Ubuntu SMP Fri Sep 27 13:53:58 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

- Others:

if I don't request gpu mem:

Events: Type Reason Age From Message

Normal Scheduled 19s default-scheduler Successfully assigned default/gpu-test-5f9f7d48d9-4wsrp to node7bm-1 Warning UnexpectedAdmissionError 20s kubelet Allocate failed due to rpc error: code = Unknown desc = no binding pod found on node node7bm-1, which is unexpected

if I request gpu mem:

Events: Type Reason Age From Message

Warning FailedScheduling 14s default-scheduler 0/3 nodes are available: 1 node(s) had untolerated taint {node-role.kubernetes.io/control-plane: }, 2 Insufficient nvidia.com/gpumem. preemption: 0/3 nodes are available: 1 Preemption is not helpful for scheduling, 2 No preemption victims found for incoming pod..

Project-HAMi / HAMi

container with privilege context failed to be schedulered #611