NVIDIA / gpu-operator

NVIDIA GPU Operator creates, configures, and manages GPUs in Kubernetes
https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/index.html
Apache License 2.0
1.85k stars 297 forks source link

nvidia-cuda-validator-xxx, nvidia-operator-validator-xxx pods are failing in crashloopback error on CentOS 7 Baremetal/A100 HGX with fabric-manager(nvswitch) system. #286

Closed sricharandevops closed 2 years ago

sricharandevops commented 2 years ago

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

1. Quick Debug Checklist

1. Issue or feature description

I have installed an nvidia GPU operator with image version 1.8.2 with driver image version 470.57on 1.21 k8s on a baremetal centos server with 8 A100 Mig GPU cards, post-install GPU operator install using helm, nvidia-cuda-validator and nvidia-operator-validator pods are getting into init, crashloopback state forever. nvida-smi is working fine. but my gpu workload test failing to detect gpu.

image

kubectl  logs  nvidia-cuda-validator-k6tbm -n gpu-operator-resources -c cuda-validation
**[Vector addition of 50000 elements]
Failed to allocate device vector A (error code system not yet initialized)!**

2. Steps to reproduce the issue:

deploy gpu helm chart gpu-operator after sometime, both the above pods gets into crashloopback state

3. Additional Info

 - [ ] Output of running a container on the GPU machine: `docker run -it alpine echo foo`
 - foo
 - [ ] Docker configuration file: `cat /etc/docker/daemon.json`
 [root@hgxlearn1000-mgmt ~]# cat /etc/docker/daemon.json
{
    "bip": "172.17.0.1/16",
    "default-runtime": "nvidia",
    "experimental": true,
    "fixed-cidr": "172.17.0.0/16",
    "live-restore": true,
    "log-driver": "json-file",
    "log-opts": {
        "max-file": "5",
        "max-size": "50m"
    },
    "runtimes": {
        "nvidia": {
            "args": [],
            "path": "/usr/local/nvidia/toolkit/nvidia-container-runtime"
        },
        "nvidia-experimental": {
            "args": [],
            "path": "/usr/local/nvidia/toolkit/nvidia-container-runtime-experimental"
        }
    },
4.
[root@hgxlearn1000-mgmt ~]# ls -la /usr/local/nvidia/toolkit
total 8548
drwxr-xr-x 3 root root    4096 Nov 26 01:39 .
drwxr-xr-x 3 root root      21 Nov 26 01:39 ..
drwxr-xr-x 3 root root      38 Nov 26 01:39 .config
lrwxrwxrwx 1 root root      28 Nov 26 01:39 libnvidia-container.so.1 -> libnvidia-container.so.1.5.1
-rwxr-xr-x 1 root root  179192 Nov 26 01:39 libnvidia-container.so.1.5.1
-rwxr-xr-x 1 root root     154 Nov 26 01:39 nvidia-container-cli
-rwxr-xr-x 1 root root   43024 Nov 26 01:39 nvidia-container-cli.real
-rwxr-xr-x 1 root root     342 Nov 26 01:39 nvidia-container-runtime
-rwxr-xr-x 1 root root     414 Nov 26 01:39 nvidia-container-runtime-experimental
-rwxr-xr-x 1 root root 3991000 Nov 26 01:39 nvidia-container-runtime.experimental
lrwxrwxrwx 1 root root      24 Nov 26 01:39 nvidia-container-runtime-hook -> nvidia-container-toolkit
-rwxr-xr-x 1 root root 2359384 Nov 26 01:39 nvidia-container-runtime.real
-rwxr-xr-x 1 root root     198 Nov 26 01:39 nvidia-container-toolkit
-rwxr-xr-x 1 root root 2147896 Nov 26 01:39 nvidia-container-toolkit.real

[root@hgxlearn1000-mgmt ~]# ls -la /run/nvidia
total 8
drwxr-xr-x  4 root root  120 Nov 26 01:39 .
drwxr-xr-x 38 root root 1480 Nov 25 20:55 ..
drwxr-xr-x  1 root root  134 Nov 26 01:38 driver
-rw-r--r--  1 root root    7 Nov 26 01:38 nvidia-driver.pid
-rw-r--r--  1 root root    7 Nov 26 01:39 toolkit.pid
drwxr-xr-x  2 root root   80 Nov 26 01:39 validations

Can anyone help me resolve this issue?

sricharandevops commented 2 years ago
nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  On   | 00000000:0B:00.0 Off |                   On |
| N/A   33C    P0    51W / 400W |     20MiB / 81251MiB |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM...  On   | 00000000:10:00.0 Off |                   On |
kpouget commented 2 years ago

@sricharandevops can you triple quote ``` all the console output blocks? will be easier to read

and maybe share the node labels,

and the logs of the mig-manager Pod

shivamerla commented 2 years ago

@sricharandevops Can you attach /var/log/messages to debug for any errors with driver. Also, i see multiple restarts of GFD, Device-Plugin pods did you apply MIG mode through MIG Manager?

shivamerla commented 2 years ago

@sricharandevops This is due to the fact that nvidia-fabric-manager service not running. We currently don't support starting this service within CentOS driver image but only for RHCOS and Ubuntu20.04. In this case, you would need to pre-install NVIDIA drivers on the node directly and start nvidia-fabric-manager services through systemd. When installing gpu-operator please pass --set driver.enabled=false so that driver container is not created.

sricharandevops commented 2 years ago

Below is the node label; Name: hgxlearn1000-mgmt.localdomain Roles: control-plane,master Labels: beta.kubernetes.io/arch=amd64 beta.kubernetes.io/os=linux feature.node.kubernetes.io/cpu-cpuid.ADX=true feature.node.kubernetes.io/cpu-cpuid.AESNI=true feature.node.kubernetes.io/cpu-cpuid.AVX=true feature.node.kubernetes.io/cpu-cpuid.AVX2=true feature.node.kubernetes.io/cpu-cpuid.FMA3=true feature.node.kubernetes.io/cpu-cpuid.IBS=true feature.node.kubernetes.io/cpu-cpuid.IBSBRNTRGT=true feature.node.kubernetes.io/cpu-cpuid.IBSFETCHSAM=true feature.node.kubernetes.io/cpu-cpuid.IBSFFV=true feature.node.kubernetes.io/cpu-cpuid.IBSOPCNT=true feature.node.kubernetes.io/cpu-cpuid.IBSOPCNTEXT=true feature.node.kubernetes.io/cpu-cpuid.IBSOPSAM=true feature.node.kubernetes.io/cpu-cpuid.IBSRDWROPCNT=true feature.node.kubernetes.io/cpu-cpuid.IBSRIPINVALIDCHK=true feature.node.kubernetes.io/cpu-cpuid.SHA=true feature.node.kubernetes.io/cpu-cpuid.SSE4=true feature.node.kubernetes.io/cpu-cpuid.SSE42=true feature.node.kubernetes.io/cpu-cpuid.SSE4A=true feature.node.kubernetes.io/cpu-cpuid.WBNOINVD=true feature.node.kubernetes.io/cpu-hardware_multithreading=true feature.node.kubernetes.io/cpu-rdt.RDTCMT=true feature.node.kubernetes.io/cpu-rdt.RDTL3CA=true feature.node.kubernetes.io/cpu-rdt.RDTMBM=true feature.node.kubernetes.io/cpu-rdt.RDTMON=true feature.node.kubernetes.io/custom-rdma.available=true feature.node.kubernetes.io/custom-rdma.capable=true feature.node.kubernetes.io/kernel-config.NO_HZ=true feature.node.kubernetes.io/kernel-config.NO_HZ_FULL=true feature.node.kubernetes.io/kernel-version.full=3.10.0-1160.45.1.el7.x86_64 feature.node.kubernetes.io/kernel-version.major=3 feature.node.kubernetes.io/kernel-version.minor=10 feature.node.kubernetes.io/kernel-version.revision=0 feature.node.kubernetes.io/memory-numa=true feature.node.kubernetes.io/network-sriov.capable=true feature.node.kubernetes.io/pci-10de.present=true feature.node.kubernetes.io/pci-10de.sriov.capable=true feature.node.kubernetes.io/pci-1a03.present=true feature.node.kubernetes.io/storage-nonrotationaldisk=true feature.node.kubernetes.io/system-os_release.ID=centos feature.node.kubernetes.io/system-os_release.VERSION_ID=7 feature.node.kubernetes.io/system-os_release.VERSION_ID.major=7 kubernetes.io/arch=amd64 kubernetes.io/hostname=hgxlearn1000-mgmt.localdomain kubernetes.io/os=linux node-role.kubernetes.io/control-plane= node-role.kubernetes.io/master= node.kubernetes.io/exclude-from-external-load-balancers= nvidia.com/cuda.driver.major=470 nvidia.com/cuda.driver.minor=57 nvidia.com/gfd.timestamp=1638005929 nvidia.com/gpu.compute.major=8 nvidia.com/gpu.compute.minor=0 nvidia.com/gpu.count=8 nvidia.com/gpu.deploy.container-toolkit=true nvidia.com/gpu.deploy.dcgm=true nvidia.com/gpu.deploy.dcgm-exporter=true nvidia.com/gpu.deploy.device-plugin=true nvidia.com/gpu.deploy.driver=true nvidia.com/gpu.deploy.gpu-feature-discovery=true nvidia.com/gpu.deploy.mig-manager=true nvidia.com/gpu.deploy.node-status-exporter=true nvidia.com/gpu.deploy.operator-validator=true nvidia.com/gpu.family=ampere nvidia.com/gpu.machine=G492-ZD2-00 nvidia.com/gpu.memory=81251 nvidia.com/gpu.present=true nvidia.com/gpu.product=NVIDIA-A100-SXM-80GB nvidia.com/mig-1g.10gb.count=8 nvidia.com/mig-1g.10gb.engines.copy=1 nvidia.com/mig-1g.10gb.engines.decoder=0 nvidia.com/mig-1g.10gb.engines.encoder=0 nvidia.com/mig-1g.10gb.engines.jpeg=0 nvidia.com/mig-1g.10gb.engines.ofa=0 nvidia.com/mig-1g.10gb.memory=9728 nvidia.com/mig-1g.10gb.multiprocessors=14 nvidia.com/mig-1g.10gb.slices.ci=1 nvidia.com/mig-1g.10gb.slices.gi=1 nvidia.com/mig-2g.20gb.count=8 nvidia.com/mig-2g.20gb.engines.copy=2 nvidia.com/mig-2g.20gb.engines.decoder=1 nvidia.com/mig-2g.20gb.engines.encoder=0 nvidia.com/mig-2g.20gb.engines.jpeg=0 nvidia.com/mig-2g.20gb.engines.ofa=0 nvidia.com/mig-2g.20gb.memory=19968 nvidia.com/mig-2g.20gb.multiprocessors=28 nvidia.com/mig-2g.20gb.slices.ci=2 nvidia.com/mig-2g.20gb.slices.gi=2 nvidia.com/mig-3g.40gb.count=8 nvidia.com/mig-3g.40gb.engines.copy=3 nvidia.com/mig-3g.40gb.engines.decoder=2 nvidia.com/mig-3g.40gb.engines.encoder=0 nvidia.com/mig-3g.40gb.engines.jpeg=0 nvidia.com/mig-3g.40gb.engines.ofa=0 nvidia.com/mig-3g.40gb.memory=40448 nvidia.com/mig-3g.40gb.multiprocessors=42 nvidia.com/mig-3g.40gb.slices.ci=3 nvidia.com/mig-3g.40gb.slices.gi=3 nvidia.com/mig.strategy=mixed robin.io/domain=ROBIN robin.io/hostname=hgxlearn1000-mgmt.localdomain robin.io/nodetype=robin-node robin.io/rnodetype=robin-master-node robin.io/robinhost=hgxlearn1000-mgmt Annotations: csi.volume.kubernetes.io/nodeid: {"robin":"hgxlearn1000-mgmt.localdomain"} kubeadm.alpha.kubernetes.io/cri-socket: /var/run/dockershim.sock nfd.node.kubernetes.io/extended-resources: nfd.node.kubernetes.io/feature-labels: cpu-cpuid.ADX,cpu-cpuid.AESNI,cpu-cpuid.AVX,cpu-cpuid.AVX2,cpu-cpuid.FMA3,cpu-cpuid.IBS,cpu-cpuid.IBSBRNTRGT,cpu-cpuid.IBSFETCHSAM,cpu-cpu... nfd.node.kubernetes.io/master.version: v0.8.2 nfd.node.kubernetes.io/worker.version: v0.8.2 node.alpha.kubernetes.io/ttl: 0 projectcalico.org/IPv4Address: 198.18.196.151/20 projectcalico.org/IPv4IPIPTunnelAddr: 172.21.93.128 volumes.kubernetes.io/controller-managed-attach-detach: true

sricharandevops commented 2 years ago

you would need to pre-install NVIDIA drivers on the node directly and start nvidia-fabric-manager services through systemd

Thank you @shivamerla for your reply, It would be helpfull if you could please provide the steps to install Nvidia drivers on centos.

sricharandevops commented 2 years ago

here is the log for mig-mgr pod

[root@hgxlearn1000-mgmt ~]#
[root@hgxlearn1000-mgmt ~]# kubectl logs nvidia-mig-manager-jzwkp -n gpu-operator-resources
W1127 09:40:29.666281       1 client_config.go:615] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
time="2021-11-27T09:40:29Z" level=info msg="Waiting for change to 'nvidia.com/mig.config' label"

I have enabled mig mode using -

nvidia-smi -mig 1

I used the below command to partition my MIG - nvidia-smi mig -cgi 9,14,19 -C

shivamerla commented 2 years ago

@sricharandevops you can download drivers from here: https://www.nvidia.in/Download/driverResults.aspx/182647/en-in and install using below steps:

Blacklist nouveau if installed, if not skip this:

$ cat << EOF | sudo tee /etc/modprobe.d/blacklist-nouveau.conf
blacklist nouveau
options nouveau modeset=0
EOF

$ sudo dracut --force

Download and install driver:

$wget https://us.download.nvidia.com/tesla/470.82.01/NVIDIA-Linux-x86_64-470.82.01.run

$sh NVIDIA-Linux-x86_64-470.82.01.run -q -a -n -X -s

Verify modules are loaded:

$ modinfo -F version nvidia
470.82.01
sricharandevops commented 2 years ago

@shivamerla , Thanks for your response,

Below is the console output., Is this warning expected? or something is wrong here?

image

sricharandevops commented 2 years ago

@shivamerla ,

How to start the faricmanger service , i dont find that service on the host

[root@hgxlearn1000-mgmt ~]# sudo systemctl status nvidia-fabricmanager
Unit nvidia-fabricmanager.service could not be found.
[root@hgxlearn1000-mgmt ~]#
shivamerla commented 2 years ago

@sricharandevops Sorry you would need to install those packages as well. Also, you can ignore the warnings during driver install.

sudo dnf module enable nvidia-driver:470/fm
sudo dnf module install nvidia-driver:470/fm

https://docs.nvidia.com/datacenter/tesla/pdf/fabric-manager-user-guide.pdf

sricharandevops commented 2 years ago

image

sricharandevops commented 2 years ago

@shivamerla, I still dont see the nvidia-fabricmanager service.

shivamerla commented 2 years ago

ah, need to check this. Please download the packages manually and install them.

curl -fSsl -O https://developer.download.nvidia.com/compute/cuda/repos/rhel7/x86_64/nvidia-fabric-manager-470.82.01-1.x86_64.rpm
curl -fSsl -O https://developer.download.nvidia.com/compute/cuda/repos/rhel7/x86_64/libnvidia-nscq-470-470.82.01-1.x86_64.rpm

 dnf localinstall -y nvidia-fabric-manager-470.82.01-1.x86_64.rpm libnvidia-nscq-470-470.82.01-1.x86_64.rpm

Then start fabric-manager services.

sricharandevops commented 2 years ago

@shivamerla , after starting up the fabric manager,

image

and reinstalling the gpu-operator, the pods are stuck in the init and operator validator pod is in crashloop back.

helm install gpu-operator nvidia/gpu-operator --set driver.enabled=false -n robinio image

kubectl describe pod nvidia-operator-validator-s9svh -n gpu-operator-resources `Events: Type Reason Age From Message


Normal Scheduled 23m default-scheduler Successfully assigned gpu-operator-resources/nvidia-operator-validator-s9svh to hgxlearn1000-mgmt.localdomain Normal Pulled 23m kubelet Container image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v1.8.2" already present on machine Normal Created 23m kubelet Created container driver-validation Normal Started 23m kubelet Started container driver-validation Normal Pulled 21m (x5 over 23m) kubelet Container image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v1.8.2" already present on machine Normal Created 21m (x5 over 23m) kubelet Created container toolkit-validation Warning Failed 21m (x5 over 23m) kubelet Error: Exception calling application: ErrorUnknown:StatusCode.UNKNOWN:failed to start container "toolkit-validation": Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:449: container init caused \"process_linux.go:432: running prestart hook 0 caused \\"error running hook: exit status 1, stdout: , stderr: /usr/local/nvidia/toolkit/nvidia-container-cli.real: /lib64/libc.so.6: version GLIBC_2.27' not found (required by /usr/local/nvidia/toolkit/libnvidia-container.so.1)\\\\n\\\"\"": unknown Warning BackOff 3m15s (x90 over 22m) kubelet Back-off restarting failed container

cdesiniotis commented 2 years ago

@sricharandevops You also need to specify --set toolkit.version=1.7.2-centos7 during install.

sricharandevops commented 2 years ago

@cdesiniotis , I tried installing gpu-operator the with --set toolkit.version=1.7.2-centos7

image

`[root@hgxlearn1000-mgmt ~]# kubectl logs nvidia-cuda-validator-nr8cq -n gpu-operator-resources cuda workload validation is successful [root@hgxlearn1000-mgmt ~]# kubectl logs nvidia-device-plugin-daemonset-br5xf -n gpu-operator-resources 2021/11/29 20:40:05 Loading NVML 2021/11/29 20:40:05 Starting FS watcher. 2021/11/29 20:40:05 Starting OS watcher. 2021/11/29 20:40:05 Retreiving plugins. 2021/11/29 20:40:05 Shutdown of NVML returned: panic: More than one MIG device type present on node

goroutine 1 [running]: main.(migStrategySingle).GetPlugins(0x1042638, 0x6, 0xae1200, 0x1042638) /go/src/nvidia-device-plugin/mig-strategy.go:124 +0x7cb main.start(0xc4202f8e80, 0x0, 0x0) /go/src/nvidia-device-plugin/main.go:146 +0x54c nvidia-device-plugin/vendor/github.com/urfave/cli/v2.(App).RunContext(0xc4202cef00, 0xae5a80, 0xc4202f0010, 0xc4202e4190, 0x1, 0x1, 0x0, 0x0) /go/src/nvidia-device-plugin/vendor/github.com/urfave/cli/v2/app.go:315 +0x6c8 nvidia-device-plugin/vendor/github.com/urfave/cli/v2.(*App).Run(0xc4202cef00, 0xc4202e4190, 0x1, 0x1, 0x456810, 0xc420363f50) /go/src/nvidia-device-plugin/vendor/github.com/urfave/cli/v2/app.go:215 +0x61 main.main() /go/src/nvidia-device-plugin/main.go:88 +0x751`

shivamerla commented 2 years ago

@sricharandevops Please edit clusterpolicy (kubectl edit clusterpolicy) and change mig.strategy=mixed you seem to have different MIG partitions setup on multiple devices. By the way, for MIG configs please use MIG Manager functionality: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/gpu-operator-mig.html#gpu-operator-with-mig

Support of MIG Manager with pre-installed drivers will be added in v1.9.0(you can try out v1.9.0-beta helm charts published already)

sricharandevops commented 2 years ago

@shivamerla ,

do we have support for centos with 1.9.0, not able to install 1.9 version., Please suggest.

image

shivamerla commented 2 years ago

@sricharandevops please delete old CRD kubectl delete crd clusterpolicies.nvidia.com and try again. Also pass --set mig.strategy=mixed to match your system config.

sricharandevops commented 2 years ago

i tried to install using 1.9-beta.

image

kubectl describe pod gpu-feature-discovery-hhcqw -n robinio `.. .. Events: Type Reason Age From Message


Normal Scheduled 4m37s default-scheduler Successfully assigned robinio/gpu-feature-discovery-hhcqw to hgxlearn1000-mgmt.localdomain Warning FailedCreatePodSandBox 4m36s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = Exception calling application: ErrorUnknown:StatusCode.UNKNOWN:RuntimeHandler "nvidia" not supported`

shivamerla commented 2 years ago

i see container-toolkit is still initializing. After toolkit is setup rest of the pods should startup.

sricharandevops commented 2 years ago

Just curiouss, I have waited for more than 30 min, does this usually take so long to bring up toolkit?

shivamerla commented 2 years ago

no, it will be quick after image is pulled. kubectl logs <toolkit-pod> -n robinio -c driver-validation and kubectl describe pod <toolkit-pod> -n robinio.

shivamerla commented 2 years ago

@sricharandevops were you able to get this working?

sricharandevops commented 2 years ago

@sricharandevops were you able to get this working?

Sorry for the delayed response!!.

I was able to get things working fine with a 1.8 GPU operator with Nvidia drivers, fabric-manager service running on the host, and GPU operator running inside containers.

sricharanrobinsystems commented 2 years ago

@shivamerla , can we have HGX A100(fabric manger) and other GPU servers on the same kubernetes cluster.. ? in which case, how can I selectively have some nodes to take a driver from the host and some from Nvidia-driver-daemonset?