Open william0212 opened 3 years ago
@william0212 Can you share the output of nvidia-smi
run from the driver pod or any of the plugin/GFD pods? Is the GPU A100 80GB? Also can you share server model and output of lspci -vvv -d 10de: -xxx
My Gpu is V100 32G. There is no driver pod , becauce I install the driver directly in the hardware and --set driver.enabled=false when depoloy the GPU operator. The log below from the pod of nvidia-operator-validator - driver-validation: **running command chroot with args [/run/nvidia/driver nvidia-smi] Thu Sep 16 01:12:40 2021 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 470.57.02 Driver Version: 470.57.02 CUDA Version: 11.4 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla V100-PCIE... Off | 00000000:3B:00.0 Off | 0 | | N/A 34C P0 27W / 250W | 0MiB / 32510MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 Tesla V100-PCIE... Off | 00000000:D8:00.0 Off | 0 | | N/A 35C P0 26W / 250W | 0MiB / 32510MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
The NFD pods is just waiting , like this:
gpu-feature-discovery: 2021/09/16 01:12:55 Running gpu-feature-discovery in version v0.4.1
gpu-feature-discovery: 2021/09/16 01:12:55 Loaded configuration:
gpu-feature-discovery: 2021/09/16 01:12:55 Oneshot: false
gpu-feature-discovery: 2021/09/16 01:12:55 FailOnInitError: true
gpu-feature-discovery: 2021/09/16 01:12:55 SleepInterval: 1m0s
gpu-feature-discovery: 2021/09/16 01:12:55 MigStrategy: single
gpu-feature-discovery: 2021/09/16 01:12:55 NoTimestamp: false
gpu-feature-discovery: 2021/09/16 01:12:55 OutputFilePath: /etc/kubernetes/node-feature-discovery/features.d/gfd
gpu-feature-discovery: 2021/09/16 01:12:55 Start running
gpu-feature-discovery: 2021/09/16 01:12:55 Writing labels to output file
gpu-feature-discovery: 2021/09/16 01:12:55 Sleeping for 1m0s**
My server is Dell with Fedora coreos OS system base OKD platform. After lspci command you tell me, it shows:
[root@worker200 core]# lspci -vvv -d 10de: -xxx
3b:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 PCIe 32GB] (rev a1)
Subsystem: NVIDIA Corporation Device 124a
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort-
Another information I want to tell you about Node Feature Discovery. Installing the version of 4.8.0 by Red Hat from OperatorHub of OKD, I find that today, all the nfd-worker in the namespace of openshift-operators is CrashLoopBackOff and shows log below:
1 nfd-worker.go:186] Node Feature Discovery Worker 1.16
I0916 01:06:26.742837 1 nfd-worker.go:187] NodeName: 'worker200.okd.med.thu'
I0916 01:06:26.743197 1 nfd-worker.go:422] configuration file "/etc/kubernetes/node-feature-discovery/nfd-worker.conf" parsed
I0916 01:06:26.743224 1 nfd-worker.go:457] worker (re-)configuration successfully completed
I0916 01:06:26.743253 1 nfd-worker.go:316] connecting to nfd-master at nfd-master:12000 ...
I0916 01:06:26.743271 1 clientconn.go:245] parsed scheme: ""
I0916 01:06:26.743281 1 clientconn.go:251] scheme "" not registered, fallback to default scheme
I0916 01:06:26.743307 1 resolver_conn_wrapper.go:172] ccResolverWrapper: sending update to cc: {[{nfd-master:12000
Today, I do uninstall the NFD operator by Red Hat and install the official NFD(v0.9.0).All the pods are running.
But after that , I use the command:
helm install --wait --generate-name \
./gpu-operator \
--set nfd.enabled=false \ (because I have deployed above)
--set operator.defaultRuntime=crio \
--set driver.enabled=false (because I have install on the local machine)
The result is the same . It is not download the cuda image 11.4.1-base-ubi8 I will show you the yaml of nvidia-cuda-validation:
kind: Pod
apiVersion: v1
metadata:
generateName: nvidia-cuda-validator-
annotations:
k8s.ovn.org/pod-networks: >-
{"default":{"ip_addresses":["10.143.0.189/23"],"mac_address":"0a:58:0a:8f:00:bd","gateway_ips":["10.143.0.1"],"ip_address":"10.143.0.189/23","gateway_ip":"10.143.0.1"}}
k8s.v1.cni.cncf.io/network-status: |-
[{
"name": "",
"interface": "eth0",
"ips": [
"10.143.0.189"
],
"mac": "0a:58:0a:8f:00:bd",
"default": true,
"dns": {}
}]
k8s.v1.cni.cncf.io/networks-status: |-
[{
"name": "",
"interface": "eth0",
"ips": [
"10.143.0.189"
],
"mac": "0a:58:0a:8f:00:bd",
"default": true,
"dns": {}
}]
openshift.io/scc: restricted
selfLink: /api/v1/namespaces/gpu-operator-resources/pods/nvidia-cuda-validator-wvcbh
resourceVersion: '10539813'
name: nvidia-cuda-validator-wvcbh
uid: c852a397-37b3-45aa-8c1a-4a3874a65098
creationTimestamp: '2021-09-16T12:47:59Z'
managedFields:
helm install --wait --generate-name ./gpu-operator \ --set nfd.enabled=false \ (because I have deployed above) --set operator.defaultRuntime=crio --set driver.enabled=false (because I have install on the local machine)
For Helm Install on OCP you have to override toolkit/dcgm images as well.
helm install gpu-operator nvidia/gpu-operator --version=1.8.2 --set platform.openshift=true,operator.defaultRuntime=crio,nfd.enabled=false,toolkit.version=1.7.1-ubi8,dcgmExporter.version=2.2.9-2.4.0-ubi8,dcgm.version=2.2.3-ubi8,migManager.version=v0.1.3-ubi8
@shivamerla I followed your instruction. The result is as same as before. The nvidia-cuda-validator init error . It didn't download the image of cuda . Please help me to find who control it for downloading the cuda image. I think it is the problem . Or is there some config problem of node-feature-discovery ? or gpu-feature-discover? This is the log from nvidia-container-toolkit: there is some error. nvidia-container-toolkit-daemonset-nv6jk-nvidia-container-toolkit-ctr.log
@william0212 cuda-validator pod doesn't download cuda images, we have vectorAdd
sample within gpu-operator-validator
image which gets invoked at runtime. Wondering if cuda 11.4.1 package installed directly on host is causing any of this.
We should see toolkit logs on the host by adding debug
fields as below.
$ cat /usr/local/nvidia/toolkit/.config/nvidia-container-runtime/config.toml
disable-require = false
[nvidia-container-cli]
debug = "/var/log/nvidia-container-cli.log"
environment = []
ldconfig = "@/run/nvidia/driver/sbin/ldconfig"
load-kmods = true
path = "/usr/local/nvidia/toolkit/nvidia-container-cli"
root = "/run/nvidia/driver"
[nvidia-container-runtime]
debug = "/var/log/nvidia-container-runtime.log"
[core@ocp-mgmt-host ~]$
[core@ocp-mgmt-host ~]$ oc get pods -n gpu-operator-resources
NAME READY STATUS RESTARTS AGE
gpu-feature-discovery-tm7nr 1/1 Running 2 6d20h
nvidia-container-toolkit-daemonset-xprxd 1/1 Running 0 6d20h
nvidia-cuda-validator-5xgst 0/1 Completed 0 6d20h
nvidia-dcgm-exporter-v29mn 1/1 Running 0 6d20h
nvidia-dcgm-q5lz7 1/1 Running 1 6d20h
nvidia-device-plugin-daemonset-92q8r 1/1 Running 1 6d20h
nvidia-device-plugin-validator-5lk29 0/1 Completed 0 6d20h
nvidia-driver-daemonset-p4cvr 1/1 Running 0 6d20h
nvidia-node-status-exporter-jc6zz 1/1 Running 0 6d20h
nvidia-operator-validator-xgmtj 1/1 Running 0 6d20h
[core@ocp-mgmt-host ~]$ oc delete pod nvidia-operator-validator-xgmtj -n gpu-operator-resources
pod "nvidia-operator-validator-xgmtj" deleted
[core@ocp-mgmt-host ~]$ ls -ltr /var/log/nvidia-container*
-rw-r--r--. 1 root root 154810 Oct 4 20:42 /var/log/nvidia-container-cli.log
in a project we face same issue, to fix, try to uninstall the NVIDIA driver from the node, and let driver: true and choose the right version of the driver (not all the Nvidia Driver has corresponding driver image) and this (make the driver: true) lets the GPU Operator install the driver and Cuda itself, I also think, when we set driver:false then we should also make the Cuda Validator off.
I have 3 nodes with Tesla-T4, A100 and A30. With Tesla-T4 nvidia-cuda-validator completed successfully, but with A100 and A30 nvidia-cuda-validator keeps crashlooping. "[Vector addition of 50000 elements] Failed to allocate vector A (error code initialization error)!" is in cuda-validator container's log. Is there any way to fix?
1. Quick Debug Checklist
i2c_core
andipmi_msghandler
loaded on the nodes? yeskubectl describe clusterpolicies --all-namespaces
) yes1. Issue or feature description
Deploy the gpu-operator in the okd (4.7.0) cluster , but the nvidia-cuda-validator pods crashlooping all the time like issue #253.
2. Steps to reproduce the issue
1) Install the nvidia driver(470.57.02) and cuda(11.4.1) directly on the GPU machine of fedora coreos system ,not in container . 2) Helm install the gpu-operator (1.8.1) with the --set driver.enabled=false parameter in the cluster. 3) Take all the needed images to local repository and change values.yaml to download from local. 4) In the namespace of gpu-operator, one pod running normally. But it the namespace of gpu-operator-resource, 5 pods running OK except for the nvidia-cuda-validator init pod crash all the time with log as below: Failed to allocate device vector A (error code no CUDA-capable device is detected)! [Vector addition of 50000 elements] At same time the nvidia-operator-vatidator init block in 2/4, waiting for it to complete. The strange thing I find is that it dose not download cuda:11.4.1-base-ubi8 image, so I guess it is the SCC problem or something like this? Or relate to the cuda install directly in the machine ? Please me with issue , thanks.