The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.
1. Quick Debug Checklist
[ ] Are you running on an Ubuntu 18.04 node?
[ ] No
[ ] Are you running Kubernetes v1.13+?
[ ] Yes
[ ] Are you running Docker (>= 18.06) or CRIO (>= 1.13+)?
[ ] Yes
[ ] Do you have i2c_core and ipmi_msghandler loaded on the nodes?
[ ] No idea
[ ] Did you apply the CRD (kubectl describe clusterpolicies --all-namespaces)
[ ] yes
We have openshift cluster with NFD and nvidia operator.
1. Issue or feature description
When I add a new type of GPU node in a cluster ( via machine sets). Nvidia operator doesn't schedule required nvidia driver pods on the new machines. I have to manually add the deamonsets with tolearation of new gpu type instance. Any specific settings I need to do here for this?
If I manually add tolerations in daemons sets, then we can find pods on these new GPU nodes.
2. Steps to reproduce the issue
Add new GPU nodes by machineset
Check the pods on new GPU machines
3. Information to attach (optional if deemed irrelevant)
[ ] kubernetes pods status: kubectl get pods --all-namespaces
[ ] kubernetes daemonset status: kubectl get ds --all-namespaces
[ ] If a pod/ds is in an error state or pending state kubectl describe pod -n NAMESPACE POD_NAME
[ ] If a pod/ds is in an error state or pending state kubectl logs -n NAMESPACE POD_NAME
[ ] Output of running a container on the GPU machine: docker run -it alpine echo foo
The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.
1. Quick Debug Checklist
i2c_core
andipmi_msghandler
loaded on the nodes?kubectl describe clusterpolicies --all-namespaces
)We have openshift cluster with NFD and nvidia operator.
1. Issue or feature description
When I add a new type of GPU node in a cluster ( via machine sets). Nvidia operator doesn't schedule required nvidia driver pods on the new machines. I have to manually add the deamonsets with tolearation of new gpu type instance. Any specific settings I need to do here for this? If I manually add tolerations in daemons sets, then we can find pods on these new GPU nodes.
2. Steps to reproduce the issue
Add new GPU nodes by machineset Check the pods on new GPU machines
3. Information to attach (optional if deemed irrelevant)
[ ] kubernetes pods status:
kubectl get pods --all-namespaces
[ ] kubernetes daemonset status:
kubectl get ds --all-namespaces
[ ] If a pod/ds is in an error state or pending state
kubectl describe pod -n NAMESPACE POD_NAME
[ ] If a pod/ds is in an error state or pending state
kubectl logs -n NAMESPACE POD_NAME
[ ] Output of running a container on the GPU machine:
docker run -it alpine echo foo
[ ] Docker configuration file:
cat /etc/docker/daemon.json
[ ] Docker runtime configuration:
docker info | grep runtime
[ ] NVIDIA shared directory:
ls -la /run/nvidia
[ ] NVIDIA packages directory:
ls -la /usr/local/nvidia/toolkit
[ ] NVIDIA driver directory:
ls -la /run/nvidia/driver
[ ] kubelet logs
journalctl -u kubelet > kubelet.logs