NVIDIA / gpu-operator

NVIDIA GPU Operator creates/configures/manages GPUs atop Kubernetes
Apache License 2.0
1.76k stars 285 forks source link

nvidia-cuda-validator pods crashlooping in OKD4.7 #259

Open william0212 opened 3 years ago

william0212 commented 3 years ago

1. Quick Debug Checklist

1. Issue or feature description

Deploy the gpu-operator in the okd (4.7.0) cluster , but the nvidia-cuda-validator pods crashlooping all the time like issue #253.

2. Steps to reproduce the issue

1) Install the nvidia driver(470.57.02) and cuda(11.4.1) directly on the GPU machine of fedora coreos system ,not in container . 2) Helm install the gpu-operator (1.8.1) with the --set driver.enabled=false parameter in the cluster. 3) Take all the needed images to local repository and change values.yaml to download from local. 4) In the namespace of gpu-operator, one pod running normally. But it the namespace of gpu-operator-resource, 5 pods running OK except for the nvidia-cuda-validator init pod crash all the time with log as below: Failed to allocate device vector A (error code no CUDA-capable device is detected)! [Vector addition of 50000 elements] At same time the nvidia-operator-vatidator init block in 2/4, waiting for it to complete. The strange thing I find is that it dose not download cuda:11.4.1-base-ubi8 image, so I guess it is the SCC problem or something like this? Or relate to the cuda install directly in the machine ? Please me with issue , thanks.

shivamerla commented 3 years ago

@william0212 Can you share the output of nvidia-smi run from the driver pod or any of the plugin/GFD pods? Is the GPU A100 80GB? Also can you share server model and output of lspci -vvv -d 10de: -xxx

william0212 commented 3 years ago

My Gpu is V100 32G. There is no driver pod , becauce I install the driver directly in the hardware and --set driver.enabled=false when depoloy the GPU operator. The log below from the pod of nvidia-operator-validator - driver-validation: **running command chroot with args [/run/nvidia/driver nvidia-smi] Thu Sep 16 01:12:40 2021 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 470.57.02 Driver Version: 470.57.02 CUDA Version: 11.4 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla V100-PCIE... Off | 00000000:3B:00.0 Off | 0 | | N/A 34C P0 27W / 250W | 0MiB / 32510MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 Tesla V100-PCIE... Off | 00000000:D8:00.0 Off | 0 | | N/A 35C P0 26W / 250W | 0MiB / 32510MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+ The NFD pods is just waiting , like this: gpu-feature-discovery: 2021/09/16 01:12:55 Running gpu-feature-discovery in version v0.4.1 gpu-feature-discovery: 2021/09/16 01:12:55 Loaded configuration: gpu-feature-discovery: 2021/09/16 01:12:55 Oneshot: false gpu-feature-discovery: 2021/09/16 01:12:55 FailOnInitError: true gpu-feature-discovery: 2021/09/16 01:12:55 SleepInterval: 1m0s gpu-feature-discovery: 2021/09/16 01:12:55 MigStrategy: single gpu-feature-discovery: 2021/09/16 01:12:55 NoTimestamp: false gpu-feature-discovery: 2021/09/16 01:12:55 OutputFilePath: /etc/kubernetes/node-feature-discovery/features.d/gfd gpu-feature-discovery: 2021/09/16 01:12:55 Start running gpu-feature-discovery: 2021/09/16 01:12:55 Writing labels to output file gpu-feature-discovery: 2021/09/16 01:12:55 Sleeping for 1m0s** My server is Dell with Fedora coreos OS system base OKD platform. After lspci command you tell me, it shows: [root@worker200 core]# lspci -vvv -d 10de: -xxx 3b:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 PCIe 32GB] (rev a1) Subsystem: NVIDIA Corporation Device 124a Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- SERR- <PERR- INTx- Latency: 0 Interrupt: pin A routed to IRQ 156 NUMA node: 0 Region 0: Memory at ab000000 (32-bit, non-prefetchable) [size=16M] Region 1: Memory at 382000000000 (64-bit, prefetchable) [size=32G] Region 3: Memory at 382800000000 (64-bit, prefetchable) [size=32M] Capabilities: [60] Power Management version 3 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-) Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME- Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+ Address: 00000000fee00078 Data: 0000 Capabilities: [78] Express (v2) Endpoint, MSI 00 DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 75.000W DevCtl: CorrErr- NonFatalErr+ FatalErr+ UnsupReq+ RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop- MaxPayload 256 bytes, MaxReadReq 512 bytes DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend- LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM not supported ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+ LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+ ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt- LnkSta: Speed 8GT/s (ok), Width x16 (ok) TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- DevCap2: Completion Timeout: Range AB, TimeoutDis+ NROPrPrP- LTR- 10BitTagComp- 10BitTagReq- OBFF Via message, ExtFmt- EETLPPrefix- EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit- FRS- TPHComp- ExtTPHComp- AtomicOpsCap: 32bit- 64bit- 128bitCAS- DevCtl2: Completion Timeout: 65ms to 210ms, TimeoutDis- LTR- OBFF Disabled, AtomicOpsCtl: ReqEn- LnkCap2: Supported Link Speeds: 2.5-8GT/s, Crosslink- Retimer- 2Retimers- DRS- LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis- Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- Compliance De-emphasis: -6dB LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete+ EqualizationPhase1+ EqualizationPhase2+ EqualizationPhase3+ LinkEqualizationRequest- Retimer- 2Retimers- CrosslinkRes: unsupported Capabilities: [100 v1] Virtual Channel Caps: LPEVC=0 RefClk=100ns PATEntryBits=1 Arb: Fixed- WRR32- WRR64- WRR128- Ctrl: ArbSelect=Fixed Status: InProgress- VC0: Caps: PATOffset=00 MaxTimeSlots=1 RejSnoopTrans- Arb: Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256- Ctrl: Enable+ ID=0 ArbSelect=Fixed TC/VC=ff Status: NegoPending- InProgress- Capabilities: [258 v1] L1 PM Substates L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+ PortCommonModeRestoreTime=255us PortTPowerOnTime=10us L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1- T_CommonMode=0us LTR1.2_Threshold=0ns L1SubCtl2: T_PwrOn=10us Capabilities: [128 v1] Power Budgeting <?> Capabilities: [420 v2] Advanced Error Reporting UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt+ RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UESvrt: DLP+ SDES+ TLP+ FCP+ CmpltTO+ CmpltAbrt+ UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol- CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr- CEMsk: RxErr+ BadTLP+ BadDLLP+ Rollover+ Timeout+ AdvNonFatalErr+ AERCap: First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn- MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap- HeaderLog: 00000000 00000000 00000000 00000000 Capabilities: [600 v1] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?> Capabilities: [900 v1] Secondary PCI Express LnkCtl3: LnkEquIntrruptEn- PerformEqu- LaneErrStat: 0 Capabilities: [ac0 v1] Designated Vendor-Specific: Vendor=10de ID=0001 Rev=1 Len=12 <?> Kernel driver in use: nvidia Kernel modules: nouveau, nvidia_drm, nvidia 00: de 10 b6 1d 07 04 10 00 a1 00 02 03 00 00 00 00 10: 00 00 00 ab 0c 00 00 00 20 38 00 00 0c 00 00 00 20: 28 38 00 00 00 00 00 00 00 00 00 00 de 10 4a 12 30: 00 00 00 00 60 00 00 00 00 00 00 00 0b 01 00 00 40: de 10 4a 12 00 00 00 00 00 00 00 00 00 00 00 00 50: 03 00 00 00 01 00 00 00 ce d6 23 00 00 00 00 00 60: 01 68 03 00 08 00 00 00 05 78 81 00 78 00 e0 fe 70: 00 00 00 00 00 00 00 00 10 00 02 00 e1 8d 2c 01 80: 3e 21 00 00 03 41 45 00 40 01 03 11 00 00 00 00 90: 00 00 00 00 00 00 00 00 00 00 00 00 13 00 04 00 a0: 06 00 00 00 0e 00 00 00 03 00 1f 00 00 00 00 00 b0: 00 00 00 00 09 00 14 01 00 00 10 80 00 00 00 00 c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 f0: 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

william0212 commented 3 years ago

Another information I want to tell you about Node Feature Discovery. Installing the version of 4.8.0 by Red Hat from OperatorHub of OKD, I find that today, all the nfd-worker in the namespace of openshift-operators is CrashLoopBackOff and shows log below: 1 nfd-worker.go:186] Node Feature Discovery Worker 1.16 I0916 01:06:26.742837 1 nfd-worker.go:187] NodeName: 'worker200.okd.med.thu' I0916 01:06:26.743197 1 nfd-worker.go:422] configuration file "/etc/kubernetes/node-feature-discovery/nfd-worker.conf" parsed I0916 01:06:26.743224 1 nfd-worker.go:457] worker (re-)configuration successfully completed I0916 01:06:26.743253 1 nfd-worker.go:316] connecting to nfd-master at nfd-master:12000 ... I0916 01:06:26.743271 1 clientconn.go:245] parsed scheme: "" I0916 01:06:26.743281 1 clientconn.go:251] scheme "" not registered, fallback to default scheme I0916 01:06:26.743307 1 resolver_conn_wrapper.go:172] ccResolverWrapper: sending update to cc: {[{nfd-master:12000 0 }] } I0916 01:06:26.743315 1 clientconn.go:674] ClientConn switching balancer to "pick_first" I0916 01:06:26.747659 1 nfd-worker.go:468] starting feature discovery... I0916 01:06:26.784109 1 nfd-worker.go:480] feature discovery completed I0916 01:06:26.784132 1 nfd-worker.go:550] sending labeling request to nfd-master E0916 01:06:26.788670 1 nfd-worker.go:557] failed to set node labels: rpc error: code = Unknown desc = nodes "worker200.okd.med.thu" is forbidden: User "system:serviceaccount:openshift-nfd:nfd-master" cannot get resource "nodes" in API group "" at the cluster scope I0916 01:06:26.788711 1 nfd-worker.go:330] closing connection to nfd-master ... F0916 01:06:26.788732 1 main.go:63] failed to advertise labels: rpc error: code = Unknown desc = nodes "worker200.okd.med.thu" is forbidden: User "system:serviceaccount:openshift-nfd:nfd-master" cannot get resource "nodes" in API group "" at the cluster scope Is this the reason of this problem and how to fix it ? Thanks again for your help .

william0212 commented 3 years ago

Today, I do uninstall the NFD operator by Red Hat and install the official NFD(v0.9.0).All the pods are running. But after that , I use the command: helm install --wait --generate-name \ ./gpu-operator \
--set nfd.enabled=false \ (because I have deployed above) --set operator.defaultRuntime=crio \ --set driver.enabled=false (because I have install on the local machine) The result is the same . It is not download the cuda image 11.4.1-base-ubi8 I will show you the yaml of nvidia-cuda-validation: kind: Pod apiVersion: v1 metadata: generateName: nvidia-cuda-validator- annotations: k8s.ovn.org/pod-networks: >- {"default":{"ip_addresses":["10.143.0.189/23"],"mac_address":"0a:58:0a:8f:00:bd","gateway_ips":["10.143.0.1"],"ip_address":"10.143.0.189/23","gateway_ip":"10.143.0.1"}} k8s.v1.cni.cncf.io/network-status: |- [{ "name": "", "interface": "eth0", "ips": [ "10.143.0.189" ], "mac": "0a:58:0a:8f:00:bd", "default": true, "dns": {} }] k8s.v1.cni.cncf.io/networks-status: |- [{ "name": "", "interface": "eth0", "ips": [ "10.143.0.189" ], "mac": "0a:58:0a:8f:00:bd", "default": true, "dns": {} }] openshift.io/scc: restricted selfLink: /api/v1/namespaces/gpu-operator-resources/pods/nvidia-cuda-validator-wvcbh resourceVersion: '10539813' name: nvidia-cuda-validator-wvcbh uid: c852a397-37b3-45aa-8c1a-4a3874a65098 creationTimestamp: '2021-09-16T12:47:59Z' managedFields:

shivamerla commented 2 years ago

helm install --wait --generate-name ./gpu-operator \ --set nfd.enabled=false \ (because I have deployed above) --set operator.defaultRuntime=crio --set driver.enabled=false (because I have install on the local machine)

For Helm Install on OCP you have to override toolkit/dcgm images as well.

helm install gpu-operator nvidia/gpu-operator --version=1.8.2 --set platform.openshift=true,operator.defaultRuntime=crio,nfd.enabled=false,toolkit.version=1.7.1-ubi8,dcgmExporter.version=2.2.9-2.4.0-ubi8,dcgm.version=2.2.3-ubi8,migManager.version=v0.1.3-ubi8
william0212 commented 2 years ago

@shivamerla I followed your instruction. The result is as same as before. The nvidia-cuda-validator init error . It didn't download the image of cuda . Please help me to find who control it for downloading the cuda image. I think it is the problem . Or is there some config problem of node-feature-discovery ? or gpu-feature-discover? 03b91a96469c7ee62ebd9d4a9146462 This is the log from nvidia-container-toolkit: there is some error. nvidia-container-toolkit-daemonset-nv6jk-nvidia-container-toolkit-ctr.log

shivamerla commented 2 years ago

@william0212 cuda-validator pod doesn't download cuda images, we have vectorAdd sample within gpu-operator-validator image which gets invoked at runtime. Wondering if cuda 11.4.1 package installed directly on host is causing any of this.

We should see toolkit logs on the host by adding debug fields as below.

$ cat /usr/local/nvidia/toolkit/.config/nvidia-container-runtime/config.toml
disable-require = false

[nvidia-container-cli]
  debug = "/var/log/nvidia-container-cli.log"
  environment = []
  ldconfig = "@/run/nvidia/driver/sbin/ldconfig"
  load-kmods = true
  path = "/usr/local/nvidia/toolkit/nvidia-container-cli"
  root = "/run/nvidia/driver"

[nvidia-container-runtime]
  debug = "/var/log/nvidia-container-runtime.log"

[core@ocp-mgmt-host ~]$ 
[core@ocp-mgmt-host ~]$ oc get pods -n gpu-operator-resources
NAME                                       READY   STATUS      RESTARTS   AGE
gpu-feature-discovery-tm7nr                1/1     Running     2          6d20h
nvidia-container-toolkit-daemonset-xprxd   1/1     Running     0          6d20h
nvidia-cuda-validator-5xgst                0/1     Completed   0          6d20h
nvidia-dcgm-exporter-v29mn                 1/1     Running     0          6d20h
nvidia-dcgm-q5lz7                          1/1     Running     1          6d20h
nvidia-device-plugin-daemonset-92q8r       1/1     Running     1          6d20h
nvidia-device-plugin-validator-5lk29       0/1     Completed   0          6d20h
nvidia-driver-daemonset-p4cvr              1/1     Running     0          6d20h
nvidia-node-status-exporter-jc6zz          1/1     Running     0          6d20h
nvidia-operator-validator-xgmtj            1/1     Running     0          6d20h

[core@ocp-mgmt-host ~]$ oc delete pod nvidia-operator-validator-xgmtj -n gpu-operator-resources
pod "nvidia-operator-validator-xgmtj" deleted

[core@ocp-mgmt-host ~]$ ls -ltr /var/log/nvidia-container*
-rw-r--r--. 1 root root 154810 Oct  4 20:42 /var/log/nvidia-container-cli.log
khanof commented 2 years ago

in a project we face same issue, to fix, try to uninstall the NVIDIA driver from the node, and let driver: true and choose the right version of the driver (not all the Nvidia Driver has corresponding driver image) and this (make the driver: true) lets the GPU Operator install the driver and Cuda itself, I also think, when we set driver:false then we should also make the Cuda Validator off.

Muscule commented 2 years ago

I have 3 nodes with Tesla-T4, A100 and A30. With Tesla-T4 nvidia-cuda-validator completed successfully, but with A100 and A30 nvidia-cuda-validator keeps crashlooping. "[Vector addition of 50000 elements] Failed to allocate vector A (error code initialization error)!" is in cuda-validator container's log. Is there any way to fix?