Closed jordimassaguerpla closed 5 years ago
The link about setting up cri-o
Hello!
Openshif uses CRIO and has a pretty good guide on this that transfers well to vanilla: https://blog.openshift.com/use-gpus-with-device-plugin-in-openshift-3-9/
If you have any errors don't hesitate to ask here :) Closing in the meantime.
Hi. If I look at the kubelet log, I see this:
journalctl -u kubelet
Jul 16 08:13:16 gpu hyperkube[22503]: I0716 08:13:16.795036 22503 nvidia.go:110] NVML initialized. Number of nvidia devices: 1
So my guess is that something is going well here.
But then, if I do
kubectl describe pods | grep nvidia | grep gpu
I got nothing. I would expect to see a node that has gpu resources ... am I assuming wrong?
How can this be debug? Are there logs for the nvidia plugin that I could look at?
thanks
I think I see where the issue may be, I don't have any pod running named nvidia-device-plugin-ctr, however I couldn't see any error when deploying https://github.com/NVIDIA/k8s-device-plugin/blob/v1.9/nvidia-device-plugin.yml.
Could you tell me where should I look for errors or how to debug this?
thanks
Hello!
Thanks!
Hi!
First thanks for your quick answer :)
I deployed k8s using SUSE CaaSP. I am working at SUSE and this was my hackweek project actually.
The node is a physical workstation with a nvidia card Geforce GTX 1060. The kubernetes master is running on kvm as a vm on my laptop.
I don't understand which logs. I run "kubectl create -f ....yaml" and didn't get much. Which logs are you referring to? I looked into the different services by using journalctl and didn't see much, but I might have look for the wrong things... or should I use "kubectl logs "?
What you need from the node? It has the gpu I said, 12GB of RAM, the disc is an external USB and has intel Xeon. It is a DELL Precision Workstation T3500. Do you need further info?
I know this is very vague but it would be great if you could give me some hints on specially which logs to look at and such.
Again, thanks a lot
jordi
Don't known if this is relevant, but here the output of running "nvidia-container-cli info"
NVRM version: 390.67 CUDA version: 9.1
Device Index: 0 Device Minor: 0 Model: GeForce GTX 1060 3GB GPU UUID: GPU-f96a76d4-7ba9-07cc-2774-bb7a55ef3e68 Bus Location: 00000000:00.0 Architecture: 6.1
Hello!
Can you provide the logs of the nvidia device plugin?
When you run kubectl create -f ....yaml
it creates pods in the kube-system namespaces (one per node) can you run the kubectl logs ...
command?
Can you provide the node description?
Meaning can you provide the output of kubectl describe nodes
?
Hi Renaud,
I wasn't looking at the kube-system namespace ... my fault.
Here is the output of running describe on the daemonset. Looks like the problem is the Pod Security Policy my user has assigned by default, which prevents mounting a host path for security issues:
kubectl describe ds -n kube-system nvidia-device-plugin-daemonset
Name: nvidia-device-plugin-daemonset
Selector: name=nvidia-device-plugin-ds
Node-Selector:
Labels: name=nvidia-device-plugin-ds
Annotations:
Desired Number of Nodes Scheduled: 0
Current Number of Nodes Scheduled: 0
Number of Nodes Scheduled with Up-to-date Pods: 0
Number of Nodes Scheduled with Available Pods: 0
Number of Nodes Misscheduled: 0
Pods Status: 0 Running / 0 Waiting / 0 Succeeded / 0 Failed
Pod Template:
Labels: name=nvidia-device-plugin-ds
Annotations: scheduler.alpha.kubernetes.io/critical-pod=
Containers:
nvidia-device-plugin-ctr:
Image: nvidia/k8s-device-plugin:1.9
Port:
Host Port:
Environment:
Mounts:
/var/lib/kubelet/device-plugins from device-plugin (rw)
Volumes:
device-plugin:
Type: HostPath (bare host directory volume)
Path: /var/lib/kubelet/device-plugins
HostPathType:
Events:
Type Reason Age From Message
Warning FailedCreate 13m (x19 over 35m) daemonset-controller Error creating: pods "nvidia-device-plugin-daemonset-" is forbidden: unable to validate against any pod security policy: [spec.volumes[0]: Invalid value: "hostPath": hostPath volumes are not allowed to be used]
So I disabled PodSecurityPolicy and I was able to start the containers.
Here the log of the container:
2018/07/17 11:00:16 Loading NVML
2018/07/17 11:00:16 Failed to initialize NVML: could not load NVML library.
2018/07/17 11:00:16 If this is a GPU node, did you set the docker default runtime to nvidia
?
2018/07/17 11:00:16 You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
2018/07/17 11:00:16 You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
and here the description of the gpu node:
2018/07/17 11:00:16 You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
Name: gpu
Roles:
OutOfDisk False Tue, 17 Jul 2018 13:03:21 +0200 Sun, 15 Jul 2018 15:39:51 +0200 KubeletHasSufficientDisk kubelet has sufficient disk space available
MemoryPressure False Tue, 17 Jul 2018 13:03:21 +0200 Tue, 17 Jul 2018 12:51:19 +0200 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Tue, 17 Jul 2018 13:03:21 +0200 Tue, 17 Jul 2018 12:51:19 +0200 KubeletHasNoDiskPressure kubelet has no disk pressure
Ready True Tue, 17 Jul 2018 13:03:21 +0200 Tue, 17 Jul 2018 12:59:51 +0200 KubeletReady kubelet is posting ready status. AppArmor enabled
Addresses:
InternalIP: 192.168.1.195
Hostname: gpu
Capacity:
cpu: 2
memory: 12295404Ki
pods: 110
Allocatable:
cpu: 2
memory: 12193004Ki
pods: 110
System Info:
Machine ID: 259a7be9d5d248a08c6485a952818cbd
System UUID: 44454C4C-4800-1053-8034-B3C04F37354A
Boot ID: 9c9fe62f-4605-4adb-a71d-8f1bb7531971
Kernel Version: 4.4.138-59-default
OS Image: SUSE CaaS Platform 3.0
Operating System: linux
Architecture: amd64
Container Runtime Version: cri-o://1.9.13
Kubelet Version: v1.9.8
Kube-Proxy Version: v1.9.8
PodCIDR: 172.16.2.0/23
ExternalID: gpu
Non-terminated Pods: (15 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits
default frontend-67f65745c-g8d64 100m (5%) 0 (0%) 100Mi (0%) 0 (0%) default frontend-67f65745c-ppcvm 100m (5%) 0 (0%) 100Mi (0%) 0 (0%) default frontend-67f65745c-rt46z 100m (5%) 0 (0%) 100Mi (0%) 0 (0%) default nvidia-smi-6 0 (0%) 0 (0%) 0 (0%) 0 (0%) default nvidia-smi-66 0 (0%) 0 (0%) 0 (0%) 0 (0%) default redis-master-585798d8ff-rfx5l 100m (5%) 0 (0%) 100Mi (0%) 0 (0%) default redis-slave-865486c9df-gwtmq 100m (5%) 0 (0%) 100Mi (0%) 0 (0%) default redis-slave-865486c9df-tvzm7 100m (5%) 0 (0%) 100Mi (0%) 0 (0%) kube-system dex-b55d98998-52sxv 0 (0%) 0 (0%) 0 (0%) 0 (0%) kube-system dex-b55d98998-9lx49 0 (0%) 0 (0%) 0 (0%) 0 (0%) kube-system haproxy-gpu 0 (0%) 0 (0%) 128Mi (1%) 128Mi (1%) kube-system kube-dns-7488679ff9-6xmgk 260m (13%) 0 (0%) 110Mi (0%) 170Mi (1%) kube-system kube-dns-7488679ff9-s4nt7 260m (13%) 0 (0%) 110Mi (0%) 170Mi (1%) kube-system kube-flannel-t6wmr 0 (0%) 0 (0%) 0 (0%) 0 (0%) kube-system nvidia-device-plugin-daemonset-4vttt 0 (0%) 0 (0%) 0 (0%) 0 (0%) Allocated resources: (Total limits may be over 100 percent, i.e., overcommitted.) CPU Requests CPU Limits Memory Requests Memory Limits
1120m (56%) 0 (0%) 948Mi (7%) 468Mi (3%) Events: Type Reason Age From Message
Normal NodeReady 12m (x2 over 2h) kubelet, gpu Node gpu status is now: NodeReady Normal NodeHasSufficientMemory 12m (x55 over 1h) kubelet, gpu Node gpu status is now: NodeHasSufficientMemory Normal NodeHasNoDiskPressure 12m (x55 over 1h) kubelet, gpu Node gpu status is now: NodeHasNoDiskPressure Normal Starting 3m kubelet, gpu Starting kubelet. Normal NodeHasSufficientDisk 3m (x2 over 3m) kubelet, gpu Node gpu status is now: NodeHasSufficientDisk Normal NodeHasSufficientMemory 3m (x2 over 3m) kubelet, gpu Node gpu status is now: NodeHasSufficientMemory Normal NodeHasNoDiskPressure 3m (x2 over 3m) kubelet, gpu Node gpu status is now: NodeHasNoDiskPressure Normal NodeAllocatableEnforced 3m kubelet, gpu Updated Node Allocatable limit across pods Normal NodeNotReady 3m kubelet, gpu Node gpu status is now: NodeNotReady Normal NodeReady 3m kubelet, gpu Node gpu status is now: NodeReady
The error seems to happen here:
If I understand correctly, this means it cannot load the libnvidia-ml.so.1 library
I don't understand thought how loading a library within a container has anything to do with having the library installed in the system. What am I missing?
Hello,
2018/07/17 11:00:16 If this is a GPU node, did you set the docker default runtime to nvidia? 2018/07/17 11:00:16 You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites 2018/07/17 11:00:16 You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
Did you set the docker default runtime to nvidia? Are you using the docker CRI runtime or the containerd runtime?
Thanks
I am using the docker-runtime-hook with cri-o, as explained in https://blog.openshift.com/use-gpus-with-device-plugin-in-openshift-3-9/
I have to run "chmod 0666 /dev/nvidia*" everytime. On every reboot and after restarting kubelet. Don't know if it can be related.
I see in the kubelet logs
Jul 16 08:13:16 gpu hyperkube[22503]: I0716 08:13:16.795036 22503 nvidia.go:110] NVML initialized. Number of nvidia devices: 1
So I think something worked here. But then, describing the node (kubectl describe) does not say anything about nvidia gpus.
Sorry for dropping this issue, @jordimassaguerpla are you still hitting this bug?
Hi, I moved to another task (this was my Hackweek project :) ). @danielorf : is this still relevant to you?
@jordimassaguerpla I was able to eventually work around our problems and get the nvidia-runtime-hook to work. I did seem to find that I could not get annotations to match correctly and had to rely on the CMD matching. I ran out of time to fully investigate though and never made a proper bug.
Thanks, closing feel free to reply here if you ever get to this bug again
Hi! Just a heads up I tried this again (it is again SUSE hackweek :) ). I found that I had to set this value:
user = "root:video"
into /etc/nvidia-container-runtime/config.toml
IIUC the key is the video group
Then, I was able to run crio with podman:
jordi@gpu:~> sudo podman run nvidia/cuda nvidia-smi
Wed Jun 26 15:34:43 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.26 Driver Version: 430.26 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Quadro K2000 Off | 00000000:05:00.0 Off | N/A |
| 30% 46C P8 N/A / N/A | 0MiB / 1998MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+
:)
Hi
I am trying to use crio with nvidia-runtime-hook, as explained in (1)However, after creating this daemonset, I run 'kubectl describe nodes" and I don't see any mention to nvidia gpus, plus the pods that require it are in pending state.
Have you tried this with crio? Have you instructions on how to make it work? And how can I debug it and get more info?
Thanks