Project-HAMi / HAMi

Heterogeneous AI Computing Virtualization Middleware
http://project-hami.io/
Apache License 2.0
655 stars 149 forks source link

Failed to initialize NVML: ERROR_UNKNOWN #452

Open wangzheyuan opened 3 weeks ago

wangzheyuan commented 3 weeks ago

If I install hami without privileged=true in daemonsetnvidia.yaml, device-plugin is CrashLoopBackOff. Here is device-plugin's log:

I0821 10:02:57.139613    9897 client.go:53] BuildConfigFromFlags failed for file /root/.kube/config: stat /root/.kube/config: no such file or directory using inClusterConfig
I0821 10:02:57.150807    9897 main.go:157] Starting FS watcher.
I0821 10:02:57.150849    9897 main.go:166] Start working on node gpu-4090
I0821 10:02:57.150852    9897 main.go:167] Starting OS watcher.
I0821 10:02:57.172809    9897 main.go:182] Starting Plugins.
I0821 10:02:57.172833    9897 main.go:240] Loading configuration.
I0821 10:02:57.172943    9897 vgpucfg.go:130] flags= [--mig-strategy value  the desired strategy for exposing MIG devices on GPUs that support it:
        [none | single | mixed] (default: "none") [$MIG_STRATEGY] --fail-on-init-error  fail the plugin if an error is encountered during initialization, otherwise block indefinitely (default: true) [$FAIL_ON_INIT_ERROR] --nvidia-driver-root value the root path for the NVIDIA driver installation (typical values are '/' or '/run/nvidia/driver') (default: "/") [$NVIDIA_DRIVER_ROOT] --pass-device-specs  pass the list of DeviceSpecs to the kubelet on Allocate() (default: false) [$PASS_DEVICE_SPECS] --device-list-strategy value [ --device-list-strategy value ]   the desired strategy for passing the device list to the underlying runtime:
        [envvar | volume-mounts | cdi-annotations] (default: "envvar") [$DEVICE_LIST_STRATEGY] --device-id-strategy value   the desired strategy for passing device IDs to the underlying runtime:
        [uuid | index] (default: "uuid") [$DEVICE_ID_STRATEGY] --gds-enabled    ensure that containers are started with NVIDIA_GDS=enabled (default: false) [$GDS_ENABLED] --mofed-enabled  ensure that containers are started with NVIDIA_MOFED=enabled (default: false) [$MOFED_ENABLED] --config-file value  the path to a config file as an alternative to command line options or environment variables [$CONFIG_FILE] --cdi-annotation-prefix value   the prefix to use for CDI container annotation keys (default: "cdi.k8s.io/") [$CDI_ANNOTATION_PREFIX] --nvidia-ctk-path value   the path to use for the nvidia-ctk in the generated CDI specification (default: "/usr/bin/nvidia-ctk") [$NVIDIA_CTK_PATH] --container-driver-root value the path where the NVIDIA driver root is mounted in the container; used for generating CDI specifications (default: "/driver-root") [$CONTAINER_DRIVER_ROOT] --node-name value  node name (default: "evecom-4090") [$NodeName] --device-split-count value   the number for NVIDIA device split (default: 2) [$DEVICE_SPLIT_COUNT] --device-memory-scaling value the ratio for NVIDIA device memory scaling (default: 1) [$DEVICE_MEMORY_SCALING] --device-cores-scaling value   the ratio for NVIDIA device cores scaling (default: 1) [$DEVICE_CORES_SCALING] --disable-core-limit If set, the core utilization limit will be ignored (default: false) [$DISABLE_CORE_LIMIT] --resource-name value the name of field for number GPU visible in container (default: "nvidia.com/gpu") --help, -h    show help --version, -v print the version]
I0821 10:02:57.173052    9897 vgpucfg.go:139] DeviceMemoryScaling 1
I0821 10:02:57.173143    9897 vgpucfg.go:108] Device Plugin Configs: {[{m5-cloudinfra-online02 1.8 0 10 none}]}
I0821 10:02:57.173147    9897 main.go:255] Updating config with default resource matching patterns.
I0821 10:02:57.173269    9897 main.go:266] 
Running with config:
{
  "version": "v1",
  "flags": {
    "migStrategy": "none",
    "failOnInitError": true,
    "nvidiaDriverRoot": "/",
    "gdsEnabled": false,
    "mofedEnabled": false,
    "useNodeFeatureAPI": null,
    "plugin": {
      "passDeviceSpecs": false,
      "deviceListStrategy": [
        "envvar"
      ],
      "deviceIDStrategy": "uuid",
      "cdiAnnotationPrefix": "cdi.k8s.io/",
      "nvidiaCTKPath": "/usr/bin/nvidia-ctk",
      "containerDriverRoot": "/driver-root"
    }
  },
  "resources": {
    "gpus": [
      {
        "pattern": "*",
        "name": "nvidia.com/gpu"
      }
    ]
  },
  "sharing": {
    "timeSlicing": {}
  },
  "ResourceName": "nvidia.com/gpu",
  "DebugMode": null
}
I0821 10:02:57.173272    9897 main.go:269] Retrieving plugins.
I0821 10:02:57.173609    9897 factory.go:107] Detected NVML platform: found NVML library
I0821 10:02:57.173628    9897 factory.go:107] Detected non-Tegra platform: /sys/devices/soc0/family file not found
config= [{* nvidia.com/gpu}]
E0821 10:02:57.199580    9897 factory.go:77] Failed to initialize NVML: ERROR_UNKNOWN.
E0821 10:02:57.199624    9897 factory.go:78] If this is a GPU node, did you set the docker default runtime to `nvidia`?
E0821 10:02:57.199627    9897 factory.go:79] You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
E0821 10:02:57.199630    9897 factory.go:80] You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
E0821 10:02:57.199632    9897 factory.go:81] If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes
E0821 10:02:57.214373    9897 main.go:126] error starting plugins: error creating plugin manager: unable to create plugin manager: nvml init failed: ERROR_UNKNOWN

If I installed hami with privileged=true in daemonsetnvidia.yaml, device-plugin works well. However, containers that request vGPU will encounter following error:

[root@gpu-4090 ~]# cat test.yaml
apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod2
  namespace: emlp
spec:
  runtimeClassName: nvidia
  containers:
    - name: test
      image: nvidia/cuda:12.1.0-base-ubuntu18.04
      imagePullPolicy: IfNotPresent
      command: ["sleep"]
      args: ["100000"]
      resources:
        limits:
          nvidia.com/gpu: 1

root@gpu-pod2:/# nvidia-smi
Failed to initialize NVML: ERROR_UNKNOWN

Here is vgpu-scheduler-extender's log:

I0822 02:42:53.118255       1 route.go:131] Start to handle webhook request on /webhook
I0822 02:42:53.118693       1 webhook.go:63] Processing admission hook for pod emlp/gpu-pod2, UID: 14ea5878-ce25-4f32-bb1e-cf6d4b42c398
I0822 02:42:53.154054       1 route.go:44] Into Predicate Route inner func
I0822 02:42:53.154209       1 scheduler.go:435] "begin schedule filter" pod="gpu-pod2" uuid="3876900e-cf59-49f0-b2f0-65fb84c8cdb9" namespaces="emlp"
I0822 02:42:53.154220       1 device.go:241] Counting mlu devices
I0822 02:42:53.154226       1 device.go:175] Counting dcu devices
I0822 02:42:53.154229       1 device.go:166] Counting iluvatar devices
I0822 02:42:53.154234       1 device.go:195] Counting ascend 910B devices
I0822 02:42:53.154238       1 ascend310p.go:209] Counting Ascend310P devices
I0822 02:42:53.154249       1 pod.go:40] "collect requestreqs" counts=[{"NVIDIA":{"Nums":1,"Type":"NVIDIA","Memreq":0,"MemPercentagereq":100,"Coresreq":0}}]
I0822 02:42:53.154272       1 score.go:32] devices status
I0822 02:42:53.154285       1 score.go:34] "device status" device id="GPU-f4a6984d-1947-3b2c-03fe-40586909cbad" device detail={"Device":{"ID":"GPU-f4a6984d-1947-3b2c-03fe-40586909cbad","Index":0,"Used":0,"Count":10,"Usedmem":0,"Totalmem":24564,"Totalcore":100,"Usedcores":0,"Numa":0,"Type":"NVIDIA-NVIDIA GeForce RTX 4090","Health":true},"Score":0}
I0822 02:42:53.154294       1 score.go:34] "device status" device id="GPU-4048ae23-1753-4d20-96d0-16be28f65017" device detail={"Device":{"ID":"GPU-4048ae23-1753-4d20-96d0-16be28f65017","Index":0,"Used":0,"Count":10,"Usedmem":0,"Totalmem":24564,"Totalcore":100,"Usedcores":0,"Numa":0,"Type":"NVIDIA-NVIDIA GeForce RTX 4090","Health":true},"Score":0}
I0822 02:42:53.154301       1 node_policy.go:61] node gpu-4090 used 0, usedCore 0, usedMem 0,
I0822 02:42:53.154306       1 node_policy.go:73] node gpu-4090 computer score is 0.000000
I0822 02:42:53.154314       1 gpu_policy.go:70] device GPU-f4a6984d-1947-3b2c-03fe-40586909cbad user 0, userCore 0, userMem 0,
I0822 02:42:53.154317       1 gpu_policy.go:76] device GPU-f4a6984d-1947-3b2c-03fe-40586909cbad computer score is 11.000000
I0822 02:42:53.154319       1 gpu_policy.go:70] device GPU-4048ae23-1753-4d20-96d0-16be28f65017 user 0, userCore 0, userMem 0,
I0822 02:42:53.154321       1 gpu_policy.go:76] device GPU-4048ae23-1753-4d20-96d0-16be28f65017 computer score is 11.000000
I0822 02:42:53.154329       1 score.go:68] "Allocating device for container request" pod="emlp/gpu-pod2" card request={"Nums":1,"Type":"NVIDIA","Memreq":0,"MemPercentagereq":100,"Coresreq":0}
I0822 02:42:53.154345       1 score.go:72] "scoring pod" pod="emlp/gpu-pod2" Memreq=0 MemPercentagereq=100 Coresreq=0 Nums=1 device index=1 device="GPU-4048ae23-1753-4d20-96d0-16be28f65017"
I0822 02:42:53.154352       1 score.go:60] checkUUID result is true for NVIDIA type
I0822 02:42:53.154358       1 score.go:124] "first fitted" pod="emlp/gpu-pod2" device="GPU-4048ae23-1753-4d20-96d0-16be28f65017"
I0822 02:42:53.154366       1 score.go:135] "device allocate success" pod="emlp/gpu-pod2" allocate device={"NVIDIA":[{"Idx":0,"UUID":"GPU-4048ae23-1753-4d20-96d0-16be28f65017","Type":"NVIDIA","Usedmem":24564,"Usedcores":0}]}
I0822 02:42:53.154370       1 scheduler.go:470] nodeScores_len= 1
I0822 02:42:53.154373       1 scheduler.go:473] schedule emlp/gpu-pod2 to evegpucom-4090 map[NVIDIA:[[{0 GPU-4048ae23-1753-4d20-96d0-16be28f65017 NVIDIA 24564 0}]]]
I0822 02:42:53.154388       1 util.go:146] Encoded container Devices: GPU-4048ae23-1753-4d20-96d0-16be28f65017,NVIDIA,24564,0:
I0822 02:42:53.154390       1 util.go:169] Encoded pod single devices GPU-4048ae23-1753-4d20-96d0-16be28f65017,NVIDIA,24564,0:;
I0822 02:42:53.154395       1 pods.go:63] Pod added: Name: gpu-pod2, UID: 3876900e-cf59-49f0-b2f0-65fb84c8cdb9, Namespace: emlp, NodeID: gpu-4090
I0822 02:42:53.162102       1 scheduler.go:368] "Bind" pod="gpu-pod2" namespace="emlp" podUID="3876900e-cf59-49f0-b2f0-65fb84c8cdb9" node="gpu-4090"
I0822 02:42:53.162380       1 util.go:237] "Decoded pod annos" poddevices={"NVIDIA":[[{"Idx":0,"UUID":"GPU-4048ae23-1753-4d20-96d0-16be28f65017","Type":"NVIDIA","Usedmem":24564,"Usedcores":0}]]}
I0822 02:42:53.169721       1 device.go:241] Counting mlu devices
I0822 02:42:53.193546       1 nodelock.go:62] "Node lock set" node="gpu-4090"
I0822 02:42:53.203870       1 util.go:237] "Decoded pod annos" poddevices={"NVIDIA":[[{"Idx":0,"UUID":"GPU-4048ae23-1753-4d20-96d0-16be28f65017","Type":"NVIDIA","Usedmem":24564,"Usedcores":0}]]}
I0822 02:42:53.207552       1 scheduler.go:421] After Binding Process
I0822 02:42:53.208761       1 util.go:237] "Decoded pod annos" poddevices={"NVIDIA":[[{"Idx":0,"UUID":"GPU-4048ae23-1753-4d20-96d0-16be28f65017","Type":"NVIDIA","Usedmem":24564,"Usedcores":0}]]}
I0822 02:42:53.254593       1 util.go:237] "Decoded pod annos" poddevices={"NVIDIA":[[{"Idx":0,"UUID":"GPU-4048ae23-1753-4d20-96d0-16be28f65017","Type":"NVIDIA","Usedmem":24564,"Usedcores":0}]]}
I0822 02:42:53.267133       1 util.go:237] "Decoded pod annos" poddevices={"NVIDIA":[[{"Idx":0,"UUID":"GPU-4048ae23-1753-4d20-96d0-16be28f65017","Type":"NVIDIA","Usedmem":24564,"Usedcores":0}]]}
I0822 02:42:53.312288       1 util.go:237] "Decoded pod annos" poddevices={"NVIDIA":[[{"Idx":0,"UUID":"GPU-4048ae23-1753-4d20-96d0-16be28f65017","Type":"NVIDIA","Usedmem":24564,"Usedcores":0}]]}
I0822 02:42:53.691029       1 util.go:237] "Decoded pod annos" poddevices={"NVIDIA":[[{"Idx":0,"UUID":"GPU-4048ae23-1753-4d20-96d0-16be28f65017","Type":"NVIDIA","Usedmem":24564,"Usedcores":0}]]}
I0822 02:42:54.525899       1 util.go:237] "Decoded pod annos" poddevices={"NVIDIA":[[{"Idx":0,"UUID":"GPU-4048ae23-1753-4d20-96d0-16be28f65017","Type":"NVIDIA","Usedmem":24564,"Usedcores":0}]]}
I0822 02:42:57.451349       1 scheduler.go:195] "New timestamp" hami.io/node-handshake="Requesting_2024.08.22 02:42:57" nodeName="gpu-4090"
I0822 02:42:57.473008       1 util.go:137] Encoded node Devices: GPU-f4a6984d-1947-3b2c-03fe-40586909cbad,10,24564,100,NVIDIA-NVIDIA GeForce RTX 4090,0,true:GPU-4048ae23-1753-4d20-96d0-16be28f65017,10,24564,100,NVIDIA-NVIDIA GeForce RTX 4090,0,true:
I0822 02:43:27.568147       1 scheduler.go:195] "New timestamp" hami.io/node-handshake="Requesting_2024.08.22 02:43:27" nodeName="gpu-4090"
I0822 02:43:27.597171       1 util.go:137] Encoded node Devices: GPU-f4a6984d-1947-3b2c-03fe-40586909cbad,10,24564,100,NVIDIA-NVIDIA GeForce RTX 4090,0,true:GPU-4048ae23-1753-4d20-96d0-16be28f65017,10,24564,100,NVIDIA-NVIDIA GeForce RTX 4090,0,true:

Ubuntu: 22.04.4 Kubernetes: RKE2 1.28.12 Containerd: v1.7.17-k3s1 NVIDIA Container Toolkit: 1.15.0

root@gpu-4090:~# nvidia-smi
Tue Aug 20 16:49:31 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4090        Off |   00000000:01:00.0 Off |                  Off |
|  0%   39C    P8             34W /  450W |      20MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 4090        Off |   00000000:04:00.0 Off |                  Off |
|  0%   34C    P8             23W /  450W |      20MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      1764      G   /usr/lib/xorg/Xorg                              9MiB |
|    0   N/A  N/A      2153      G   /usr/bin/gnome-shell                           10MiB |
|    1   N/A  N/A      1764      G   /usr/lib/xorg/Xorg                              4MiB |
+-----------------------------------------------------------------------------------------+

root@gpu-4090:~# cat /var/lib/rancher/rke2/agent/etc/containerd/config.toml
version = 2

[plugins."io.containerd.internal.v1.opt"]
  path = "/var/lib/rancher/rke2/agent/containerd"
[plugins."io.containerd.grpc.v1.cri"]
  stream_server_address = "127.0.0.1"
  stream_server_port = "10010"
  enable_selinux = false
  enable_unprivileged_ports = true
  enable_unprivileged_icmp = true
  sandbox_image = "index.docker.io/rancher/mirrored-pause:3.6"

[plugins."io.containerd.grpc.v1.cri".containerd]
  snapshotter = "overlayfs"
  disable_snapshot_annotations = true

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
  runtime_type = "io.containerd.runc.v2"

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
  SystemdCgroup = true

[plugins."io.containerd.grpc.v1.cri".registry]
  config_path = "/var/lib/rancher/rke2/agent/etc/containerd/certs.d"

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia"]
  runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia".options]
  BinaryName = "/usr/bin/nvidia-container-runtime"
  SystemdCgroup = true

root@gpu-4090:~# cat hami/values.yaml
scheduler:
  kubeScheduler:
    image: registry.k8s.io/kube-scheduler
    imageTag: v1.28.12
  nodeSelector:
    kubernetes.io/hostname: gpu-4090
devicePlugin:
  runtimeClassName: nvidia
archlitchi commented 3 weeks ago

it seems your nvidia-driver may not be installed correctly, you can try install nvidia-device-plugin v0.14, can see if that can be launched correctly

wangzheyuan commented 3 weeks ago

NVIDIA GPU Operator works fine, but nvidia-device-plugin v0.14.5 has the same error:

I0822 08:51:42.921468       1 main.go:154] Starting FS watcher.
I0822 08:51:42.921503       1 main.go:161] Starting OS watcher.
I0822 08:51:42.921566       1 main.go:176] Starting Plugins.
I0822 08:51:42.921574       1 main.go:234] Loading configuration.
I0822 08:51:42.921623       1 main.go:242] Updating config with default resource matching patterns.
I0822 08:51:42.921704       1 main.go:253] 
Running with config:
{
  "version": "v1",
  "flags": {
    "migStrategy": "none",
    "failOnInitError": true,
    "nvidiaDriverRoot": "/",
    "gdsEnabled": false,
    "mofedEnabled": false,
    "plugin": {
      "passDeviceSpecs": false,
      "deviceListStrategy": [
        "envvar"
      ],
      "deviceIDStrategy": "uuid",
      "cdiAnnotationPrefix": "cdi.k8s.io/",
      "nvidiaCTKPath": "/usr/bin/nvidia-ctk",
      "containerDriverRoot": "/driver-root"
    }
  },
  "resources": {
    "gpus": [
      {
        "pattern": "*",
        "name": "nvidia.com/gpu"
      }
    ]
  },
  "sharing": {
    "timeSlicing": {}
  }
}
I0822 08:51:42.921708       1 main.go:256] Retreiving plugins.
I0822 08:51:42.921955       1 factory.go:107] Detected NVML platform: found NVML library
I0822 08:51:42.921968       1 factory.go:107] Detected non-Tegra platform: /sys/devices/soc0/family file not found
E0822 08:51:42.925620       1 factory.go:77] Failed to initialize NVML: ERROR_UNKNOWN.
E0822 08:51:42.925629       1 factory.go:78] If this is a GPU node, did you set the docker default runtime to `nvidia`?
E0822 08:51:42.925630       1 factory.go:79] You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
E0822 08:51:42.925632       1 factory.go:80] You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
E0822 08:51:42.925634       1 factory.go:81] If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes
E0822 08:51:42.925723       1 main.go:123] error starting plugins: error creating plugin manager: unable to create plugin manager: nvml init failed: ERROR_UNKNOWN
lengrongfu commented 3 weeks ago

you can look toolkit pod log.

wangzheyuan commented 1 week ago

you can look toolkit pod log.

You mean NVIDIA Container Toolkit?