Project-HAMi / HAMi

Heterogeneous AI Computing Virtualization Middleware
http://project-hami.io/
Apache License 2.0
476 stars 117 forks source link

nvidia.com/gpu: 0 #314

Open mama0512 opened 1 month ago

mama0512 commented 1 month ago
  1. kubectl describe node 416a100 Name: 416a100 Roles: master Labels: beta.kubernetes.io/arch=amd64 beta.kubernetes.io/instance-type=k3s beta.kubernetes.io/os=linux gpu=on k3s.io/hostname=416a100 k3s.io/internal-ip=192.168.2.145 kubernetes.io/arch=amd64 kubernetes.io/hostname=416a100 kubernetes.io/os=linux node-role.kubernetes.io/master=true node.kubernetes.io/instance-type=k3s Annotations: flannel.alpha.coreos.com/backend-data: {"VtepMAC":"a2:0a:a5:6d:d7:7e"} flannel.alpha.coreos.com/backend-type: vxlan flannel.alpha.coreos.com/kube-subnet-manager: true flannel.alpha.coreos.com/public-ip: 192.168.2.145 hami.io/mutex.lock: 2024-05-13T13:04:17Z hami.io/node-handshake: Requesting_2024.05.20 11:44:46 hami.io/node-nvidia-register: GPU-b7c4eb59-dd76-ca5a-8482-56fd796b0a75,10,40960,100,NVIDIA-NVIDIA A100-PCIE-40GB,0,true:GPU-ec7d894f-bb24-dc73-1adb-17806ec68749,10,4096... k3s.io/node-args: ["server","--docker"] k3s.io/node-config-hash: 6DWNFWQMIJPJNOOSYKGNXGNN7DPG53Z77PAPIQ56XNVT2UPS3TFA==== k3s.io/node-env: {"K3S_DATA_DIR":"/var/lib/rancher/k3s/data/cba07c8500bccabd42d9215a6af6b01181cb6ca5755d12ae1e4e02b27b50bafa"} node.alpha.kubernetes.io/ttl: 0 volumes.kubernetes.io/controller-managed-attach-detach: true CreationTimestamp: Thu, 09 May 2024 15:48:32 +0800 Taints: Unschedulable: false Lease: HolderIdentity: 416a100 AcquireTime: RenewTime: Tue, 21 May 2024 11:41:41 +0800 Conditions: Type Status LastHeartbeatTime LastTransitionTime Reason Message

    NetworkUnavailable False Tue, 21 May 2024 00:17:41 +0800 Tue, 21 May 2024 00:17:41 +0800 FlannelIsUp Flannel is running on this node MemoryPressure False Tue, 21 May 2024 11:41:24 +0800 Thu, 09 May 2024 15:48:31 +0800 KubeletHasSufficientMemory kubelet has sufficient memory available DiskPressure False Tue, 21 May 2024 11:41:24 +0800 Thu, 09 May 2024 15:48:31 +0800 KubeletHasNoDiskPressure kubelet has no disk pressure PIDPressure False Tue, 21 May 2024 11:41:24 +0800 Thu, 09 May 2024 15:48:31 +0800 KubeletHasSufficientPID kubelet has sufficient PID available Ready True Tue, 21 May 2024 11:41:24 +0800 Tue, 21 May 2024 00:17:52 +0800 KubeletReady kubelet is posting ready status. AppArmor enabled Addresses: InternalIP: 192.168.2.145 Hostname: 416a100 Capacity: cpu: 80 ephemeral-storage: 459819088Ki hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 263739228Ki nvidia.com/gpu: 0 pods: 110 Allocatable: cpu: 80 ephemeral-storage: 447312008456 hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 263739228Ki nvidia.com/gpu: 0 pods: 110 2.nvidia-smi nvidia-smi Tue May 21 11:44:30 2024 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 550.54.15 Driver Version: 550.54.15 CUDA Version: 12.4 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA A100-PCIE-40GB Off | 00000000:36:00.0 Off | 0 | | N/A 42C P0 46W / 250W | 13MiB / 40960MiB | 0% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+ | 1 NVIDIA A100-PCIE-40GB Off | 00000000:37:00.0 Off | 0 | | N/A 42C P0 45W / 250W | 13MiB / 40960MiB | 0% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+ | 2 NVIDIA RTX A6000 Off | 00000000:9D:00.0 Off | Off | | 30% 37C P8 22W / 300W | 14MiB / 49140MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 3 NVIDIA RTX A6000 Off | 00000000:9E:00.0 Off | Off | | 30% 36C P8 28W / 300W | 14MiB / 49140MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 2480 G /usr/lib/xorg/Xorg 4MiB | | 1 N/A N/A 2480 G /usr/lib/xorg/Xorg 4MiB | | 2 N/A N/A 2480 G /usr/lib/xorg/Xorg 4MiB | | 3 N/A N/A 2480 G /usr/lib/xorg/Xorg 4MiB |

3.sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi Tue May 21 03:44:52 2024 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 550.54.15 Driver Version: 550.54.15 CUDA Version: 12.4 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA A100-PCIE-40GB Off | 00000000:36:00.0 Off | 0 | | N/A 42C P0 46W / 250W | 13MiB / 40960MiB | 0% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+ | 1 NVIDIA A100-PCIE-40GB Off | 00000000:37:00.0 Off | 0 | | N/A 42C P0 45W / 250W | 13MiB / 40960MiB | 0% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+ | 2 NVIDIA RTX A6000 Off | 00000000:9D:00.0 Off | Off | | 30% 37C P8 22W / 300W | 14MiB / 49140MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 3 NVIDIA RTX A6000 Off | 00000000:9E:00.0 Off | Off | | 30% 35C P8 27W / 300W | 14MiB / 49140MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | No running processes found | +-----------------------------------------------------------------------------------------+

4.hami pod: kubectl get pods -n kube-system NAME READY STATUS RESTARTS AGE helm-install-traefik-nzsh4 0/1 Completed 0 11d svclb-traefik-cpwxf 2/2 Running 40 11d metrics-server-7b4f8b595-5kn69 1/1 Running 21 11d local-path-provisioner-64d457c485-nccpm 1/1 Running 20 11d coredns-5d69dc75db-q7rxn 1/1 Running 20 11d traefik-5dd496474-rxmr2 1/1 Running 20 11d nvidia-device-plugin-daemonset-jg762 1/1 Running 0 5m50s hami-device-plugin-nv5gs 2/2 Running 0 4m43s hami-scheduler-757847d79f-n7dbf 2/2 Running 0 4m43s

lengrongfu commented 1 month ago

This issue is about what problem?