Closed sricharandevops closed 2 years ago
nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02 Driver Version: 470.57.02 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100-SXM... On | 00000000:0B:00.0 Off | On |
| N/A 33C P0 51W / 400W | 20MiB / 81251MiB | N/A Default |
| | | Enabled |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A100-SXM... On | 00000000:10:00.0 Off | On |
@sricharandevops can you triple quote ```
all the console output blocks? will be easier to read
and maybe share the node labels,
and the logs of the mig-manager
Pod
@sricharandevops Can you attach /var/log/messages
to debug for any errors with driver. Also, i see multiple restarts of GFD, Device-Plugin pods did you apply MIG mode through MIG Manager?
@sricharandevops This is due to the fact that nvidia-fabric-manager
service not running. We currently don't support starting this service within CentOS driver image but only for RHCOS and Ubuntu20.04. In this case, you would need to pre-install NVIDIA drivers on the node directly and start nvidia-fabric-manager
services through systemd. When installing gpu-operator please pass --set driver.enabled=false
so that driver container is not created.
Below is the node label;
Name: hgxlearn1000-mgmt.localdomain Roles: control-plane,master Labels: beta.kubernetes.io/arch=amd64 beta.kubernetes.io/os=linux feature.node.kubernetes.io/cpu-cpuid.ADX=true feature.node.kubernetes.io/cpu-cpuid.AESNI=true feature.node.kubernetes.io/cpu-cpuid.AVX=true feature.node.kubernetes.io/cpu-cpuid.AVX2=true feature.node.kubernetes.io/cpu-cpuid.FMA3=true feature.node.kubernetes.io/cpu-cpuid.IBS=true feature.node.kubernetes.io/cpu-cpuid.IBSBRNTRGT=true feature.node.kubernetes.io/cpu-cpuid.IBSFETCHSAM=true feature.node.kubernetes.io/cpu-cpuid.IBSFFV=true feature.node.kubernetes.io/cpu-cpuid.IBSOPCNT=true feature.node.kubernetes.io/cpu-cpuid.IBSOPCNTEXT=true feature.node.kubernetes.io/cpu-cpuid.IBSOPSAM=true feature.node.kubernetes.io/cpu-cpuid.IBSRDWROPCNT=true feature.node.kubernetes.io/cpu-cpuid.IBSRIPINVALIDCHK=true feature.node.kubernetes.io/cpu-cpuid.SHA=true feature.node.kubernetes.io/cpu-cpuid.SSE4=true feature.node.kubernetes.io/cpu-cpuid.SSE42=true feature.node.kubernetes.io/cpu-cpuid.SSE4A=true feature.node.kubernetes.io/cpu-cpuid.WBNOINVD=true feature.node.kubernetes.io/cpu-hardware_multithreading=true feature.node.kubernetes.io/cpu-rdt.RDTCMT=true feature.node.kubernetes.io/cpu-rdt.RDTL3CA=true feature.node.kubernetes.io/cpu-rdt.RDTMBM=true feature.node.kubernetes.io/cpu-rdt.RDTMON=true feature.node.kubernetes.io/custom-rdma.available=true feature.node.kubernetes.io/custom-rdma.capable=true feature.node.kubernetes.io/kernel-config.NO_HZ=true feature.node.kubernetes.io/kernel-config.NO_HZ_FULL=true feature.node.kubernetes.io/kernel-version.full=3.10.0-1160.45.1.el7.x86_64 feature.node.kubernetes.io/kernel-version.major=3 feature.node.kubernetes.io/kernel-version.minor=10 feature.node.kubernetes.io/kernel-version.revision=0 feature.node.kubernetes.io/memory-numa=true feature.node.kubernetes.io/network-sriov.capable=true feature.node.kubernetes.io/pci-10de.present=true feature.node.kubernetes.io/pci-10de.sriov.capable=true feature.node.kubernetes.io/pci-1a03.present=true feature.node.kubernetes.io/storage-nonrotationaldisk=true feature.node.kubernetes.io/system-os_release.ID=centos feature.node.kubernetes.io/system-os_release.VERSION_ID=7 feature.node.kubernetes.io/system-os_release.VERSION_ID.major=7 kubernetes.io/arch=amd64 kubernetes.io/hostname=hgxlearn1000-mgmt.localdomain kubernetes.io/os=linux node-role.kubernetes.io/control-plane= node-role.kubernetes.io/master= node.kubernetes.io/exclude-from-external-load-balancers= nvidia.com/cuda.driver.major=470 nvidia.com/cuda.driver.minor=57 nvidia.com/gfd.timestamp=1638005929 nvidia.com/gpu.compute.major=8 nvidia.com/gpu.compute.minor=0 nvidia.com/gpu.count=8 nvidia.com/gpu.deploy.container-toolkit=true nvidia.com/gpu.deploy.dcgm=true nvidia.com/gpu.deploy.dcgm-exporter=true nvidia.com/gpu.deploy.device-plugin=true nvidia.com/gpu.deploy.driver=true nvidia.com/gpu.deploy.gpu-feature-discovery=true nvidia.com/gpu.deploy.mig-manager=true nvidia.com/gpu.deploy.node-status-exporter=true nvidia.com/gpu.deploy.operator-validator=true nvidia.com/gpu.family=ampere nvidia.com/gpu.machine=G492-ZD2-00 nvidia.com/gpu.memory=81251 nvidia.com/gpu.present=true nvidia.com/gpu.product=NVIDIA-A100-SXM-80GB nvidia.com/mig-1g.10gb.count=8 nvidia.com/mig-1g.10gb.engines.copy=1 nvidia.com/mig-1g.10gb.engines.decoder=0 nvidia.com/mig-1g.10gb.engines.encoder=0 nvidia.com/mig-1g.10gb.engines.jpeg=0 nvidia.com/mig-1g.10gb.engines.ofa=0 nvidia.com/mig-1g.10gb.memory=9728 nvidia.com/mig-1g.10gb.multiprocessors=14 nvidia.com/mig-1g.10gb.slices.ci=1 nvidia.com/mig-1g.10gb.slices.gi=1 nvidia.com/mig-2g.20gb.count=8 nvidia.com/mig-2g.20gb.engines.copy=2 nvidia.com/mig-2g.20gb.engines.decoder=1 nvidia.com/mig-2g.20gb.engines.encoder=0 nvidia.com/mig-2g.20gb.engines.jpeg=0 nvidia.com/mig-2g.20gb.engines.ofa=0 nvidia.com/mig-2g.20gb.memory=19968 nvidia.com/mig-2g.20gb.multiprocessors=28 nvidia.com/mig-2g.20gb.slices.ci=2 nvidia.com/mig-2g.20gb.slices.gi=2 nvidia.com/mig-3g.40gb.count=8 nvidia.com/mig-3g.40gb.engines.copy=3 nvidia.com/mig-3g.40gb.engines.decoder=2 nvidia.com/mig-3g.40gb.engines.encoder=0 nvidia.com/mig-3g.40gb.engines.jpeg=0 nvidia.com/mig-3g.40gb.engines.ofa=0 nvidia.com/mig-3g.40gb.memory=40448 nvidia.com/mig-3g.40gb.multiprocessors=42 nvidia.com/mig-3g.40gb.slices.ci=3 nvidia.com/mig-3g.40gb.slices.gi=3 nvidia.com/mig.strategy=mixed robin.io/domain=ROBIN robin.io/hostname=hgxlearn1000-mgmt.localdomain robin.io/nodetype=robin-node robin.io/rnodetype=robin-master-node robin.io/robinhost=hgxlearn1000-mgmt Annotations: csi.volume.kubernetes.io/nodeid: {"robin":"hgxlearn1000-mgmt.localdomain"} kubeadm.alpha.kubernetes.io/cri-socket: /var/run/dockershim.sock nfd.node.kubernetes.io/extended-resources: nfd.node.kubernetes.io/feature-labels: cpu-cpuid.ADX,cpu-cpuid.AESNI,cpu-cpuid.AVX,cpu-cpuid.AVX2,cpu-cpuid.FMA3,cpu-cpuid.IBS,cpu-cpuid.IBSBRNTRGT,cpu-cpuid.IBSFETCHSAM,cpu-cpu... nfd.node.kubernetes.io/master.version: v0.8.2 nfd.node.kubernetes.io/worker.version: v0.8.2 node.alpha.kubernetes.io/ttl: 0 projectcalico.org/IPv4Address: 198.18.196.151/20 projectcalico.org/IPv4IPIPTunnelAddr: 172.21.93.128 volumes.kubernetes.io/controller-managed-attach-detach: true
you would need to pre-install NVIDIA drivers on the node directly and start
nvidia-fabric-manager
services through systemd
Thank you @shivamerla for your reply, It would be helpfull if you could please provide the steps to install Nvidia drivers on centos.
here is the log for mig-mgr pod
[root@hgxlearn1000-mgmt ~]#
[root@hgxlearn1000-mgmt ~]# kubectl logs nvidia-mig-manager-jzwkp -n gpu-operator-resources
W1127 09:40:29.666281 1 client_config.go:615] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
time="2021-11-27T09:40:29Z" level=info msg="Waiting for change to 'nvidia.com/mig.config' label"
I have enabled mig mode using -
nvidia-smi -mig 1
I used the below command to partition my MIG -
nvidia-smi mig -cgi 9,14,19 -C
@sricharandevops you can download drivers from here: https://www.nvidia.in/Download/driverResults.aspx/182647/en-in and install using below steps:
Blacklist nouveau if installed, if not skip this:
$ cat << EOF | sudo tee /etc/modprobe.d/blacklist-nouveau.conf
blacklist nouveau
options nouveau modeset=0
EOF
$ sudo dracut --force
Download and install driver:
$wget https://us.download.nvidia.com/tesla/470.82.01/NVIDIA-Linux-x86_64-470.82.01.run
$sh NVIDIA-Linux-x86_64-470.82.01.run -q -a -n -X -s
Verify modules are loaded:
$ modinfo -F version nvidia
470.82.01
@shivamerla , Thanks for your response,
Below is the console output., Is this warning expected? or something is wrong here?
@shivamerla ,
How to start the faricmanger service , i dont find that service on the host
[root@hgxlearn1000-mgmt ~]# sudo systemctl status nvidia-fabricmanager
Unit nvidia-fabricmanager.service could not be found.
[root@hgxlearn1000-mgmt ~]#
@sricharandevops Sorry you would need to install those packages as well. Also, you can ignore the warnings during driver install.
sudo dnf module enable nvidia-driver:470/fm
sudo dnf module install nvidia-driver:470/fm
https://docs.nvidia.com/datacenter/tesla/pdf/fabric-manager-user-guide.pdf
@shivamerla, I still dont see the nvidia-fabricmanager service.
ah, need to check this. Please download the packages manually and install them.
curl -fSsl -O https://developer.download.nvidia.com/compute/cuda/repos/rhel7/x86_64/nvidia-fabric-manager-470.82.01-1.x86_64.rpm
curl -fSsl -O https://developer.download.nvidia.com/compute/cuda/repos/rhel7/x86_64/libnvidia-nscq-470-470.82.01-1.x86_64.rpm
dnf localinstall -y nvidia-fabric-manager-470.82.01-1.x86_64.rpm libnvidia-nscq-470-470.82.01-1.x86_64.rpm
Then start fabric-manager services.
@shivamerla , after starting up the fabric manager,
and reinstalling the gpu-operator, the pods are stuck in the init and operator validator pod is in crashloop back.
helm install gpu-operator nvidia/gpu-operator --set driver.enabled=false -n robinio
kubectl describe pod nvidia-operator-validator-s9svh -n gpu-operator-resources
`Events:
Type Reason Age From Message
Normal Scheduled 23m default-scheduler Successfully assigned gpu-operator-resources/nvidia-operator-validator-s9svh to hgxlearn1000-mgmt.localdomain
Normal Pulled 23m kubelet Container image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v1.8.2" already present on machine
Normal Created 23m kubelet Created container driver-validation
Normal Started 23m kubelet Started container driver-validation
Normal Pulled 21m (x5 over 23m) kubelet Container image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v1.8.2" already present on machine
Normal Created 21m (x5 over 23m) kubelet Created container toolkit-validation
Warning Failed 21m (x5 over 23m) kubelet Error: Exception calling application: ErrorUnknown:StatusCode.UNKNOWN:failed to start container "toolkit-validation": Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:449: container init caused \"process_linux.go:432: running prestart hook 0 caused \\"error running hook: exit status 1, stdout: , stderr: /usr/local/nvidia/toolkit/nvidia-container-cli.real: /lib64/libc.so.6: version GLIBC_2.27' not found (required by /usr/local/nvidia/toolkit/libnvidia-container.so.1)\\\\n\\\"\"": unknown Warning BackOff 3m15s (x90 over 22m) kubelet Back-off restarting failed container
@sricharandevops You also need to specify --set toolkit.version=1.7.2-centos7
during install.
@cdesiniotis , I tried installing gpu-operator the with --set toolkit.version=1.7.2-centos7
`[root@hgxlearn1000-mgmt ~]# kubectl logs nvidia-cuda-validator-nr8cq -n gpu-operator-resources
cuda workload validation is successful
[root@hgxlearn1000-mgmt ~]# kubectl logs nvidia-device-plugin-daemonset-br5xf -n gpu-operator-resources
2021/11/29 20:40:05 Loading NVML
2021/11/29 20:40:05 Starting FS watcher.
2021/11/29 20:40:05 Starting OS watcher.
2021/11/29 20:40:05 Retreiving plugins.
2021/11/29 20:40:05 Shutdown of NVML returned:
goroutine 1 [running]: main.(migStrategySingle).GetPlugins(0x1042638, 0x6, 0xae1200, 0x1042638) /go/src/nvidia-device-plugin/mig-strategy.go:124 +0x7cb main.start(0xc4202f8e80, 0x0, 0x0) /go/src/nvidia-device-plugin/main.go:146 +0x54c nvidia-device-plugin/vendor/github.com/urfave/cli/v2.(App).RunContext(0xc4202cef00, 0xae5a80, 0xc4202f0010, 0xc4202e4190, 0x1, 0x1, 0x0, 0x0) /go/src/nvidia-device-plugin/vendor/github.com/urfave/cli/v2/app.go:315 +0x6c8 nvidia-device-plugin/vendor/github.com/urfave/cli/v2.(*App).Run(0xc4202cef00, 0xc4202e4190, 0x1, 0x1, 0x456810, 0xc420363f50) /go/src/nvidia-device-plugin/vendor/github.com/urfave/cli/v2/app.go:215 +0x61 main.main() /go/src/nvidia-device-plugin/main.go:88 +0x751`
@sricharandevops Please edit clusterpolicy (kubectl edit clusterpolicy
) and change mig.strategy=mixed
you seem to have different MIG partitions setup on multiple devices. By the way, for MIG configs please use MIG Manager functionality: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/gpu-operator-mig.html#gpu-operator-with-mig
Support of MIG Manager with pre-installed drivers will be added in v1.9.0(you can try out v1.9.0-beta helm charts published already)
@shivamerla ,
do we have support for centos with 1.9.0, not able to install 1.9 version., Please suggest.
@sricharandevops please delete old CRD kubectl delete crd clusterpolicies.nvidia.com
and try again. Also pass --set mig.strategy=mixed
to match your system config.
i tried to install using 1.9-beta.
kubectl describe pod gpu-feature-discovery-hhcqw -n robinio
`..
..
Events:
Type Reason Age From Message
Normal Scheduled 4m37s default-scheduler Successfully assigned robinio/gpu-feature-discovery-hhcqw to hgxlearn1000-mgmt.localdomain Warning FailedCreatePodSandBox 4m36s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = Exception calling application: ErrorUnknown:StatusCode.UNKNOWN:RuntimeHandler "nvidia" not supported`
i see container-toolkit is still initializing. After toolkit is setup rest of the pods should startup.
Just curiouss, I have waited for more than 30 min, does this usually take so long to bring up toolkit?
no, it will be quick after image is pulled. kubectl logs <toolkit-pod> -n robinio -c driver-validation
and kubectl describe pod <toolkit-pod> -n robinio
.
@sricharandevops were you able to get this working?
@sricharandevops were you able to get this working?
Sorry for the delayed response!!.
I was able to get things working fine with a 1.8 GPU operator with Nvidia drivers, fabric-manager service running on the host, and GPU operator running inside containers.
@shivamerla , can we have HGX A100(fabric manger) and other GPU servers on the same kubernetes cluster.. ? in which case, how can I selectively have some nodes to take a driver from the host and some from Nvidia-driver-daemonset?
The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.
1. Quick Debug Checklist
i2c_core
andipmi_msghandler
loaded on the nodes? yeskubectl describe clusterpolicies --all-namespaces
). yes1. Issue or feature description
I have installed an nvidia GPU operator with image version 1.8.2 with driver image version 470.57on 1.21 k8s on a baremetal centos server with 8 A100 Mig GPU cards, post-install GPU operator install using helm, nvidia-cuda-validator and nvidia-operator-validator pods are getting into init, crashloopback state forever. nvida-smi is working fine. but my gpu workload test failing to detect gpu.
2. Steps to reproduce the issue:
deploy gpu helm chart gpu-operator after sometime, both the above pods gets into crashloopback state
3. Additional Info
Can anyone help me resolve this issue?