Open somethingwentwell opened 1 year ago
What do the plugin logs look like, and what resources does your node say it has under Capacity
and Allocatable
when running kubectl get node
?
Here is the output of kubectl describe node
kubectl describe node server1
Name: server1
Roles: control-plane
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/os=linux
kubernetes.io/arch=amd64
kubernetes.io/hostname=server1
kubernetes.io/os=linux
node-role.kubernetes.io/control-plane=
node.kubernetes.io/exclude-from-external-load-balancers=
Annotations: kubeadm.alpha.kubernetes.io/cri-socket: unix:///var/run/containerd/containerd.sock
node.alpha.kubernetes.io/ttl: 0
projectcalico.org/IPv4Address: 192.168.122.148/24
projectcalico.org/IPv4VXLANTunnelAddr: 10.233.79.64
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Mon, 28 Nov 2022 09:16:27 +0000
Taints: <none>
Unschedulable: false
Lease:
HolderIdentity: server1
AcquireTime: <unset>
RenewTime: Mon, 28 Nov 2022 10:22:21 +0000
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
NetworkUnavailable False Mon, 28 Nov 2022 09:17:19 +0000 Mon, 28 Nov 2022 09:17:19 +0000 CalicoIsUp Calico is running on this node
MemoryPressure False Mon, 28 Nov 2022 10:22:16 +0000 Mon, 28 Nov 2022 09:16:26 +0000 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Mon, 28 Nov 2022 10:22:16 +0000 Mon, 28 Nov 2022 09:16:26 +0000 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Mon, 28 Nov 2022 10:22:16 +0000 Mon, 28 Nov 2022 09:16:26 +0000 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Mon, 28 Nov 2022 10:22:16 +0000 Mon, 28 Nov 2022 09:18:05 +0000 KubeletReady kubelet is posting ready status. AppArmor enabled
Addresses:
InternalIP: 192.168.122.148
Hostname: server1
Capacity:
cpu: 4
ephemeral-storage: 204794888Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 4025584Ki
pods: 110
Allocatable:
cpu: 3800m
ephemeral-storage: 188738968469
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 3398896Ki
pods: 110
System Info:
Machine ID: 695b8befa78a443a950c66a055df670a
System UUID: 695b8bef-a78a-443a-950c-66a055df670a
Boot ID: e70f1479-1827-4387-b052-7e9a1a0d7211
Kernel Version: 5.4.0-132-generic
OS Image: Ubuntu 20.04.5 LTS
Operating System: linux
Architecture: amd64
Container Runtime Version: containerd://1.6.10
Kubelet Version: v1.25.4
Kube-Proxy Version: v1.25.4
PodCIDR: 10.233.64.0/24
PodCIDRs: 10.233.64.0/24
Non-terminated Pods: (14 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits Age
--------- ---- ------------ ---------- --------------- ------------- ---
cert-manager cert-manager-55b8b5b94f-bxxbw 0 (0%) 0 (0%) 0 (0%) 0 (0%) 65m
cert-manager cert-manager-cainjector-655669b754-dd7qr 0 (0%) 0 (0%) 0 (0%) 0 (0%) 65m
cert-manager cert-manager-webhook-77d689b6df-xq25h 0 (0%) 0 (0%) 0 (0%) 0 (0%) 65m
kube-system calico-kube-controllers-d6484b75c-b2d6v 30m (0%) 1 (26%) 64M (1%) 256M (7%) 65m
kube-system calico-node-ndjpn 150m (3%) 300m (7%) 64M (1%) 500M (14%) 65m
kube-system coredns-588bb58b94-bhs45 100m (2%) 0 (0%) 70Mi (2%) 300Mi (9%) 64m
kube-system dns-autoscaler-d8bd87bcc-65cdd 20m (0%) 0 (0%) 10Mi (0%) 0 (0%) 64m
kube-system kube-apiserver-server1 250m (6%) 0 (0%) 0 (0%) 0 (0%) 65m
kube-system kube-controller-manager-server1 200m (5%) 0 (0%) 0 (0%) 0 (0%) 65m
kube-system kube-proxy-d8std 0 (0%) 0 (0%) 0 (0%) 0 (0%) 65m
kube-system kube-scheduler-server1 100m (2%) 0 (0%) 0 (0%) 0 (0%) 65m
kube-system local-volume-provisioner-8q86f 0 (0%) 0 (0%) 0 (0%) 0 (0%) 64m
kube-system nodelocaldns-xlx8j 100m (2%) 0 (0%) 70Mi (2%) 200Mi (6%) 64m
kube-system nvidia-device-plugin-daemonset-7989w 0 (0%) 0 (0%) 0 (0%) 0 (0%) 54m
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 950m (25%) 1300m (34%)
memory 285286400 (8%) 1280288k (36%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
Events: <none>
any update?
It seems that the plugin is not advertising any GPUs. Can you post the logs of the plugin?
Hi there!! Was the error solved? Because I am facing the same error and I am not able to solve it. Would be a huge help, if you could help me out here.
This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.
Was the error solved?
The plugin log requested in https://github.com/NVIDIA/k8s-device-plugin/issues/348#issuecomment-1369699003 were never supplied. @xlcbingo1999 if you are stting similar behaviour, please provide a description of your setup as well as the plugin logs.
I0619 14:39:57.345606 1 main.go:178] Starting FS watcher. I0619 14:39:57.345911 1 main.go:185] Starting OS watcher. I0619 14:39:57.346248 1 main.go:200] Starting Plugins. I0619 14:39:57.346272 1 main.go:257] Loading configuration. I0619 14:39:57.346836 1 main.go:265] Updating config with default resource matching patterns. I0619 14:39:57.347470 1 main.go:276] Running with config: { "version": "v1", "flags": { "migStrategy": "none", "failOnInitError": false, "mpsRoot": "", "nvidiaDriverRoot": "/", "gdsEnabled": false, "mofedEnabled": false, "useNodeFeatureAPI": null, "plugin": { "passDeviceSpecs": false, "deviceListStrategy": [ "envvar" ], "deviceIDStrategy": "uuid", "cdiAnnotationPrefix": "cdi.k8s.io/", "nvidiaCTKPath": "/usr/bin/nvidia-ctk", "containerDriverRoot": "/driver-root" } }, "resources": { "gpus": [ { "pattern": "*", "name": "nvidia.com/gpu" } ] }, "sharing": { "timeSlicing": {} } } I0619 14:39:57.347485 1 main.go:279] Retrieving plugins. W0619 14:39:57.347555 1 factory.go:31] No valid resources detected, creating a null CDI handler I0619 14:39:57.347606 1 factory.go:104] Detected non-NVML platform: could not load NVML library: libnvidia-ml.so.1: cannot open shared object file: No such file or directory I0619 14:39:57.347638 1 factory.go:104] Detected non-Tegra platform: /sys/devices/soc0/family file not found E0619 14:39:57.347646 1 factory.go:112] Incompatible platform detected E0619 14:39:57.347650 1 factory.go:113] If this is a GPU node, did you configure the NVIDIA Container Toolkit? E0619 14:39:57.347654 1 factory.go:114] You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites E0619 14:39:57.347659 1 factory.go:115] You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start E0619 14:39:57.347663 1 factory.go:116] If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes I0619 14:39:57.347670 1 main.go:308] No devices found. Waiting indefinitely.
The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.
1. Issue or feature description
2. Steps to reproduce the issue
The VM is Ubuntu20.04
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 470.141.03 Driver Version: 470.141.03 CUDA Version: 11.4 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 GRID A100D-20C On | 00000000:06:00.0 Off | 0 | | N/A N/A P0 N/A / N/A | 1589MiB / 20475MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+
Client Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.4", GitCommit:"872a965c6c6526caa949f0c6ac028ef7aff3fb78", GitTreeState:"clean", BuildDate:"2022-11-11T02:46:24Z", GoVersion:"go1.19.3", Compiler:"gc", Platform:"linux/amd64"} Kustomize Version: v4.5.7 Server Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.4", GitCommit:"872a965c6c6526caa949f0c6ac028ef7aff3fb78", GitTreeState:"clean", BuildDate:"2022-11-09T13:29:58Z", GoVersion:"go1.19.3", Compiler:"gc", Platform:"linux/amd64"}
nvidia-container-toolkit --version NVIDIA Container Runtime Hook version 1.11.0 commit: d9de4a0
version = 2 root = "/var/lib/containerd" state = "/run/containerd" oom_score = 0
[grpc] max_recv_message_size = 16777216 max_send_message_size = 16777216
[debug] level = "info"
[metrics] address = "" grpc_histogram = false
[plugins] [plugins."io.containerd.grpc.v1.cri"] sandbox_image = "registry.k8s.io/pause:3.7" max_container_log_line_size = -1 [plugins."io.containerd.grpc.v1.cri".containerd] default_runtime_name = "nvidia" snapshotter = "overlayfs" [plugins."io.containerd.grpc.v1.cri".containerd.runtimes] [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc] runtime_type = "io.containerd.runc.v2" runtime_engine = "" runtime_root = "" base_runtime_spec = "/etc/containerd/cri-base.json"
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.12.3/nvidia-device-plugin.yml
cat <<EOF | kubectl apply -f - apiVersion: v1 kind: Pod metadata: name: gpu-pod spec: restartPolicy: Never containers:
3. Information to attach (optional if deemed irrelevant)
containerd version
KVM config