Nvidia-gpu-tool shows GPUs are broken

vishnukumarkalidasan commented 5 months ago

System background: A5000 + 2xH100 graphic card setup AMD EPYC 7643 48-Core Processor

root@tiger:/shared/nvtrust/host_tools/python/gpu-admin-tools# python3 nvidia_gpu_tools.py
NVIDIA GPU Tools version v2024.02.14o
Command line arguments: ['nvidia_gpu_tools.py']
  File "/shared/nvtrust/host_tools/python/gpu-admin-tools/nvidia_gpu_tools.py", line 110, in find_gpus_sysfs
    dev = Gpu(dev_path=dev_path)
  File "/shared/nvtrust/host_tools/python/gpu-admin-tools/nvidia_gpu_tools.py", line 3671, in __init__
    raise BrokenGpuError()
2024-04-04,15:32:27.719 ERROR    GPU /sys/bus/pci/devices/0000:01:00.0 broken:
2024-04-04,15:32:27.720 ERROR    Config space working True
  File "/shared/nvtrust/host_tools/python/gpu-admin-tools/nvidia_gpu_tools.py", line 110, in find_gpus_sysfs
    dev = Gpu(dev_path=dev_path)
  File "/shared/nvtrust/host_tools/python/gpu-admin-tools/nvidia_gpu_tools.py", line 3671, in __init__
    raise BrokenGpuError()
2024-04-04,15:32:27.726 ERROR    GPU /sys/bus/pci/devices/0000:41:00.0 broken:
2024-04-04,15:32:27.728 ERROR    Config space working True
  File "/shared/nvtrust/host_tools/python/gpu-admin-tools/nvidia_gpu_tools.py", line 110, in find_gpus_sysfs
    dev = Gpu(dev_path=dev_path)
  File "/shared/nvtrust/host_tools/python/gpu-admin-tools/nvidia_gpu_tools.py", line 3671, in __init__
    raise BrokenGpuError()
2024-04-04,15:32:27.734 ERROR    GPU /sys/bus/pci/devices/0000:61:00.0 broken:
2024-04-04,15:32:27.735 ERROR    Config space working True
GPUs:
  0 GPU 0000:01:00.0 [broken, cfg space working 1 bars configured 1]
  1 GPU 0000:41:00.0 [broken, cfg space working 1 bars configured 1]
  2 GPU 0000:61:00.0 [broken, cfg space working 1 bars configured 1]
Other:
2024-04-04,15:32:27.735 INFO     No GPU specified, select GPU with --gpu, --gpu-bdf, or --gpu-name

lspci -vd 10de: --> output

61:00.0 3D controller: NVIDIA Corporation Device 2331 (rev a1)
        Subsystem: NVIDIA Corporation Device 1626
        Flags: fast devsel, IRQ 255, NUMA node 0, IOMMU group 64
        Memory at 1e042000000 (64-bit, prefetchable) [size=16M]
        Memory at 1a000000000 (64-bit, prefetchable) [size=128G]
        Memory at 1e040000000 (64-bit, prefetchable) [size=32M]
        Capabilities: [40] Power Management version 3
        Capabilities: [48] MSI: Enable- Count=1/16 Maskable+ 64bit+
        Capabilities: [60] Express Endpoint, MSI 00
        Capabilities: [9c] Vendor Specific Information: Len=14 <?>
        Capabilities: [b0] MSI-X: Enable- Count=9 Masked-
        Capabilities: [100] Secondary PCI Express
        Capabilities: [12c] Latency Tolerance Reporting
        Capabilities: [134] Physical Resizable BAR
        Capabilities: [140] Virtual Resizable BAR
        Capabilities: [14c] Data Link Feature <?>
        Capabilities: [158] Physical Layer 16.0 GT/s <?>
        Capabilities: [188] Extended Capability ID 0x2a
        Capabilities: [1b8] Advanced Error Reporting
        Capabilities: [200] Lane Margining at the Receiver <?>
        Capabilities: [248] Alternative Routing-ID Interpretation (ARI)
        Capabilities: [250] Single Root I/O Virtualization (SR-IOV)
        Capabilities: [2a4] Vendor Specific Information: ID=0001 Rev=1 Len=014 <?>
        Capabilities: [2b8] Power Budgeting <?>
        Capabilities: [2c8] Extended Capability ID 0x2e
        Capabilities: [2f0] Device Serial Number 4b-49-cd-a7-48-2d-b0-48
        Kernel modules: nvidiafb, nouveau

lscpu --> output

Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         43 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  192
  On-line CPU(s) list:   0-191
Vendor ID:               AuthenticAMD
  Model name:            AMD EPYC 7643 48-Core Processor
    CPU family:          25
    Model:               1
    Thread(s) per core:  2
    Core(s) per socket:  48
    Socket(s):           2
    Stepping:            1
    Frequency boost:     disabled
    CPU max MHz:         3640.9170
    CPU min MHz:         1500.0000
    BogoMIPS:            4600.01
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor s                         sse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3
                         cdp_l3 invpcid_single hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 invpcid cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irp                         erf xsaveerptr rdpru wbnoinvd amd_ppin brs arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold v_vmsave_vmload vgif v_spec_ctrl umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor smc                         a sme sev sev_es sev_snp
Virtualization features:
  Virtualization:        AMD-V
Caches (sum of all):
  L1d:                   3 MiB (96 instances)
  L1i:                   3 MiB (96 instances)
  L2:                    48 MiB (96 instances)
  L3:                    512 MiB (16 instances)
NUMA:
  NUMA node(s):          2
  NUMA node0 CPU(s):     0-47,96-143
  NUMA node1 CPU(s):     48-95,144-191
Vulnerabilities:
  Itlb multihit:         Not affected
  L1tf:                  Not affected
  Mds:                   Not affected
  Meltdown:              Not affected
  Mmio stale data:       Not affected
  Spec store bypass:     Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:            Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP always-on, RSB filling
  Srbds:                 Not affected
  Tsx async abort:       Not affected

vishnukumarkalidasan commented 5 months ago

Here is the lspci -vvv output of one of the GPUs

61:00.0 3D controller: NVIDIA Corporation Device 2331 (rev a1)
        Subsystem: NVIDIA Corporation Device 1626
        Control: I/O- Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Interrupt: pin A routed to IRQ 255
        NUMA node: 0
        IOMMU group: 64
        Region 0: Memory at 1e042000000 (64-bit, prefetchable) [size=16M]
        Region 2: Memory at 1a000000000 (64-bit, prefetchable) [size=128G]
        Region 4: Memory at 1e040000000 (64-bit, prefetchable) [size=32M]
        Capabilities: [40] Power Management version 3
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [48] MSI: Enable- Count=1/16 Maskable+ 64bit+
                Address: 0000000000000000  Data: 0000
                Masking: 00000000  Pending: 00000000
        Capabilities: [60] Express (v2) Endpoint, MSI 00
                DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us
                        ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 75.000W
                DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq-
                        RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
                        MaxPayload 256 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr+ NonFatalErr- FatalErr- UnsupReq+ AuxPwr- TransPend-
                LnkCap: Port #0, Speed 32GT/s, Width x16, ASPM L1, Exit Latency L1 <4us
                        ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
                LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 16GT/s (downgraded), Width x16 (ok)
                        TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Range AB, TimeoutDis+ NROPrPrP- LTR+
                         10BitTagComp+ 10BitTagReq+ OBFF Via message, ExtFmt- EETLPPrefix-
                         EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
                         FRS- TPHComp- ExtTPHComp-
                         AtomicOpsCap: 32bit- 64bit- 128bitCAS-
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR+ OBFF Disabled,
                         AtomicOpsCtl: ReqEn-
                LnkCap2: Supported Link Speeds: 2.5-32GT/s, Crosslink- Retimer+ 2Retimers+ DRS-
                LnkCtl2: Target Link Speed: 32GT/s, EnterCompliance- SpeedDis-
                         Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                         Compliance De-emphasis: -6dB
                LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+ EqualizationPhase1+
                         EqualizationPhase2+ EqualizationPhase3+ LinkEqualizationRequest-
                         Retimer- 2Retimers- CrosslinkRes: unsupported
        Capabilities: [9c] Vendor Specific Information: Len=14 <?>
        Capabilities: [b0] MSI-X: Enable- Count=9 Masked-
                Vector table: BAR=0 offset=00b90000
                PBA: BAR=0 offset=00ba0000
        Capabilities: [100 v1] Secondary PCI Express
                LnkCtl3: LnkEquIntrruptEn- PerformEqu-
                LaneErrStat: 0
        Capabilities: [12c v1] Latency Tolerance Reporting
                Max snoop latency: 1048576ns
                Max no snoop latency: 1048576ns
        Capabilities: [134 v1] Physical Resizable BAR
                BAR 2: current size: 128GB, supported: 64MB 128MB 256MB 512MB 1GB 2GB 4GB 8GB 16GB 32GB 64GB 128GB
        Capabilities: [140 v1] Virtual Resizable BAR
                BAR 1: current size: 4GB, supported: 64MB 128MB 256MB 512MB 1GB 2GB 4GB 8GB 16GB 32GB 64GB 128GB 256TB 512TB 1PB 2PB 4PB 8PB 16PB 32PB 64PB 128PB 256PB 512PB 1EB 2EB 4EB 8EB
        Capabilities: [14c v1] Data Link Feature <?>
        Capabilities: [158 v1] Physical Layer 16.0 GT/s <?>
        Capabilities: [188 v1] Extended Capability ID 0x2a
        Capabilities: [1b8 v2] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
                UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO+ CmpltAbrt- UnxCmplt+ RxOF+ MalfTLP+ ECRC+ UnsupReq- ACSViol-
                CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
                CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
                AERCap: First Error Pointer: 00, ECRCGenCap+ ECRCGenEn+ ECRCChkCap+ ECRCChkEn+
                        MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
                HeaderLog: 00000000 00000000 00000000 00000000
        Capabilities: [200 v1] Lane Margining at the Receiver <?>
        Capabilities: [248 v1] Alternative Routing-ID Interpretation (ARI)
                ARICap: MFVC- ACS-, Next Function: 0
                ARICtl: MFVC- ACS-, Function Group: 0
        Capabilities: [250 v1] Single Root I/O Virtualization (SR-IOV)
                IOVCap: Migration-, Interrupt Message Number: 000
                IOVCtl: Enable- Migration- Interrupt- MSE- ARIHierarchy+
                IOVSta: Migration-
                Initial VFs: 32, Total VFs: 32, Number of VFs: 0, Function Dependency Link: 00
                VF offset: 2, stride: 1, Device ID: 2331
                Supported Page Size: 00000573, System Page Size: 00000001
                Region 0: Memory at ae200000 (32-bit, non-prefetchable)
                Region 1: Memory at 000001c000000000 (64-bit, prefetchable)
                Region 3: Memory at 000001e000000000 (64-bit, prefetchable)
                VF Migration: offset: 00000000, BIR: 0
        Capabilities: [2a4 v1] Vendor Specific Information: ID=0001 Rev=1 Len=014 <?>
        Capabilities: [2b8 v1] Power Budgeting <?>
        Capabilities: [2c8 v1] Extended Capability ID 0x2e
        Capabilities: [2f0 v1] Device Serial Number 4b-49-cd-a7-48-2d-b0-48
        Kernel modules: nvidiafb, nouveau

vishnukumarkalidasan commented 5 months ago

Update 1:

enabled CC in both H100s via sysfs mmio access.

root@tiger:/shared/nvtrust/host_tools/python/gpu-admin-tools# python3 nvidia_gpu_tools.py --gpu=2 --mmio-access-type sysfs
NVIDIA GPU Tools version v2024.02.14o
Command line arguments: ['nvidia_gpu_tools.py', '--gpu=2', '--mmio-access-type', 'sysfs']
2024-04-05,11:44:26.708 WARNING  GPU 0000:41:00.0 ? 0x2331 BAR0 0x0 was in D3, forced power control to on (prev auto). New state D0
2024-04-05,11:44:26.760 WARNING  GPU 0000:61:00.0 ? 0x2331 BAR0 0x0 was in D3, forced power control to on (prev auto). New state D0
GPUs:
  0 GPU 0000:01:00.0 A10 0x2231 BAR0 0xf8000000
  1 GPU 0000:41:00.0 H100-PCIE 0x2331 BAR0 0x26042000000
  2 GPU 0000:61:00.0 H100-PCIE 0x2331 BAR0 0x1e042000000
Other:
Topo:
  PCI 0000:00:01.1 0x1022:0x1483
   GPU 0000:01:00.0 A10 0x2231 BAR0 0xf8000000
  PCI 0000:40:01.1 0x1022:0x1483
   GPU 0000:41:00.0 H100-PCIE 0x2331 BAR0 0x26042000000
  PCI 0000:60:03.1 0x1022:0x1483
   GPU 0000:61:00.0 H100-PCIE 0x2331 BAR0 0x1e042000000
2024-04-05,11:44:26.760 INFO     Selected GPU 0000:61:00.0 H100-PCIE 0x2331 BAR0 0x1e042000000
2024-04-05,11:44:26.760 WARNING  GPU 0000:61:00.0 H100-PCIE 0x2331 BAR0 0x1e042000000 has CC mode on, some functionality may not work
2024-04-05,11:44:26.761 WARNING  GPU 0000:61:00.0 H100-PCIE 0x2331 BAR0 0x1e042000000 restoring power control to auto
2024-04-05,11:44:26.761 WARNING  GPU 0000:41:00.0 H100-PCIE 0x2331 BAR0 0x26042000000 restoring power control to auto

root@tiger:/shared/nvtrust/host_tools/python/gpu-admin-tools# python3 nvidia_gpu_tools.py --gpu-name=H100 --mmio-access-type sysfs
NVIDIA GPU Tools version v2024.02.14o
Command line arguments: ['nvidia_gpu_tools.py', '--gpu-name=H100', '--mmio-access-type', 'sysfs']
2024-04-05,11:43:44.596 WARNING  GPU 0000:41:00.0 ? 0x2331 BAR0 0x0 was in D3, forced power control to on (prev auto). New state D0
2024-04-05,11:43:44.648 WARNING  GPU 0000:61:00.0 ? 0x2331 BAR0 0x0 was in D3, forced power control to on (prev auto). New state D0
AMD_SEV_DIR=/shared/AMDSEV/snp-release-2023-12-19
Topo:
  PCI 0000:00:01.1 0x1022:0x1483
   GPU 0000:01:00.0 A10 0x2231 BAR0 0xf8000000
  PCI 0000:40:01.1 0x1022:0x1483
   GPU 0000:41:00.0 H100-PCIE 0x2331 BAR0 0x26042000000
  PCI 0000:60:03.1 0x1022:0x1483
   GPU 0000:61:00.0 H100-PCIE 0x2331 BAR0 0x1e042000000
2024-04-05,11:43:44.648 INFO     Selected GPU 0000:41:00.0 H100-PCIE 0x2331 BAR0 0x26042000000
2024-04-05,11:43:44.648 WARNING  GPU 0000:41:00.0 H100-PCIE 0x2331 BAR0 0x26042000000 has CC mode on, some functionality may not work
2024-04-05,11:43:44.649 WARNING  GPU 0000:61:00.0 H100-PCIE 0x2331 BAR0 0x1e042000000 restoring power control to auto
2024-04-05,11:43:44.649 WARNING  GPU 0000:41:00.0 H100-PCIE 0x2331 BAR0 0x26042000000 restoring power control to auto

vishnukumarkalidasan commented 5 months ago

update 2:

GPU fails to load inside the VM:

dmesg | tail logs

[  556.039789] NVRM spdmStart_IMPL: SPDM: Certificate retrieval failed!
[  556.039795] NVRM spdmStart_IMPL: SPDM: Session establishment failed!
[  556.039807] NVRM nvCheckOkFailedNoLog: Check failed: Invalid data passed [NV_ERR_INVALID_DATA] (0x00000025) returned from spdmStart(pGpu, pConfCompute->pSpdm) @ conf_compute.c:238
[  556.039812] NVRM RmInitNvDevice: *** Cannot pre-initialize the device
[  556.039814] NVRM RmInitAdapter: RmInitNvDevice failed, bailing out of RmInitAdapter
[  556.039835] NVOC: __nvoc_objDelete: Child class Spdm not freed from parent class ConfidentialCompute.NVOC: __nvoc_objDelete: Child class GenericKernelFalcon not freed from parent class OBJGPU.NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x23:0x25:899)
[  556.748085] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[  556.850711] NVRM gpumgrCheckRmFirmwarePolicy: Disabling GSP offload -- GPU not supported
[  556.850734] NVRM osInitNvMapping: *** Cannot attach gpu
[  556.850737] NVRM RmInitAdapter: osInitNvMapping failed, bailing out of RmInitAdapter
[  556.850750] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x22:0x56:631)
[  556.852912] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[  556.961720] nvidia-uvm: Loaded the UVM driver, major device number 238.

nvcc output

root@sev:~# nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Jul_11_02:20:44_PDT_2023
Cuda compilation tools, release 12.2, V12.2.128
Build cuda_12.2.r12.2/compiler.33053471_0

driver and toolkit installed via cuda_12.2.1_535.86.10_linux.run

Tan-YiFan commented 5 months ago

Could you provide the guest boot script? And what version of Linux is the guest running (uname -a in the guest)?

If I got the output SPDM: Certificate retrieval failed!, I would hack the driver, adding some prints to find out what's going wrong.

steven-bellock commented 5 months ago

SPDM: Certificate retrieval failed!

usually means the ecdsa_generic and/or ecdh kernel modules have not been loaded.

vishnukumarkalidasan commented 5 months ago

Could you provide the guest boot script? And what version of Linux is the guest running (uname -a in the guest)?

If I got the output SPDM: Certificate retrieval failed!, I would hack the driver, adding some prints to find out what's going wrong.

root@sev:~# uname -a
Linux sev 6.5.0-26-generic #26~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Tue Mar 12 10:22:43 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

this is the output of uname -a

vishnukumarkalidasan commented 5 months ago

SPDM: Certificate retrieval failed!

usually means the ecdsa_generic and/or ecdh kernel modules have not been loaded.

Okay, So I did modprobe of both ecdsa and ecdh_generic and then rescanned the PCI bus. Now i can see the GPU in nvidia-smi

usage count for both modules are 0

Module                  Size  Used by                                                                                                                                                                                                                                      ecdh_generic           16384  0
ecdsa_generic          16384  0 
ecc                           45056  3 ecdh_generic,ecdsa_generic,nvidia

here is the output

but when I ran the nvidia-smi the second time. it failed with page fault and kernel crash.

[  886.434042] BUG: unable to handle page fault for address: 00000000000152ba
[  886.435780] #PF: supervisor read access in kernel mode
[  886.436748] #PF: error_code(0x0000) - not-present page                                                                                                                                                                                                                  [  886.438155] PGD 0 P4D 0
[  886.438942] Oops: 0000 [#1] PREEMPT SMP NOPTI 
[  886.439843] CPU: 24 PID: 523 Comm: nv_open_q Tainted: G           OE      6.5.0-26-generic #26~22.04.1-Ubuntu
[  886.442057] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown unknown
[  886.443964] RIP: 0010:freeRpcInfrastructure_VGPU+0x25/0xe0 [nvidia]
[  886.445151] Code: ff eb 85 66 90 f3 0f 1e fa 55 48 89 e5 41 55 41 bd 40 00 00 00 41 54 53 48 83 ec 08 8b 87 e4 04 00 00 48 8b 1c c5 00 9a 8d c0 <80> bb ba 52 01 00 00 74 5f 45 31 ed 80 bf 83 02 00 00 00 49 89 fc
[  886.449557] RSP: 0018:ffffa750811eba60 EFLAGS: 00010286
[  886.451053] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000020
[  886.452405] RDX: 0000000000000001 RSI: ffffa750811eba54 RDI: ffff965f868c0020
[  886.453594] RBP: ffffa750811eba80 R08: ffffffffc0668208 R09: 0000000000000000
[  886.455593] R10: ffffa750811eb8a8 R11: 0000000000000000 R12: 0000000000000000
[  886.456994] R13: 0000000000000040 R14: 0000000000000000 R15: 00000000180000a1
[  886.458996] FS:  0000000000000000(0000) GS:ffff966ebf400000(0000) knlGS:0000000000000000
[  886.460546] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  886.461485] CR2: 00000000000152ba CR3: 00080009fda3a000 CR4: 00000000003506e0
[  886.463525] Call Trace:
[  886.463981]  <TASK>
[  886.464340]  ? show_regs+0x6d/0x80
[  886.465294]  ? __die+0x24/0x80
[  886.466147]  ? page_fault_oops+0x99/0x1b0
[  886.467322]  ? do_user_addr_fault+0x31d/0x6b0
[  886.468305]  ? exc_page_fault+0x83/0x1b0
[  886.469205]  ? asm_exc_page_fault+0x27/0x30
[  886.470151]  ? vgpuDestructObject+0xf8/0x110 [nvidia]
[  886.471584]  ? freeRpcInfrastructure_VGPU+0x25/0xe0 [nvidia]
[  886.472787]  vgpuDestructObject+0x5b/0x110 [nvidia]
[  886.474083]  gpuDestruct_IMPL+0x377/0x3e0 [nvidia]
[  886.475463]  __nvoc_dtor_OBJGPU+0x15/0x40 [nvidia]
[  886.476732]  __nvoc_objDelete+0x2c/0xf0 [nvidia]
[  886.477958]  gpumgrAttachGpu+0x90a/0xea0 [nvidia]
[  886.479218]  RmInitAdapter+0x5ad/0x19b0 [nvidia]
[  886.479979]  ? srso_return_thunk+0x5/0x10
[  886.480510]  ? srso_return_thunk+0x5/0x10
[  886.481050]  ? srso_return_thunk+0x5/0x10
[  886.481584]  ? _raw_spin_lock_irqsave+0xe/0x20
[  886.482166]  ? srso_return_thunk+0x5/0x10
[  886.482595]  rm_init_adapter+0xad/0xc0 [nvidia]
[  886.483322]  nv_open_device+0x42b/0xa20 [nvidia]
[  886.484022]  nvidia_open_deferred+0x39/0xb0 [nvidia]
[  886.484757]  _main_loop+0x82/0x140 [nvidia]
[  886.485377]  ? __pfx__main_loop+0x10/0x10 [nvidia]
[  886.486085]  kthread+0xf2/0x120
[  886.486522]  ? __pfx_kthread+0x10/0x10
[  886.487020]  ret_from_fork+0x47/0x70
[  886.487506]  ? __pfx_kthread+0x10/0x10
[  886.488013]  ret_from_fork_asm+0x1b/0x30
[  886.488537]  </TASK>
[  886.488829] Modules linked in: ecdh_generic ecdsa_generic binfmt_misc nls_iso8859_1 intel_rapl_msr intel_rapl_common ppdev input_leds parport_pc sev_guest serio_raw parport nvidia_uvm(OE) mac_hid qemu_fw_cfg sch_fq_codel dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua ramoops msr reed_solomon pstore_blk pstore_zone efi_pstore ip_tables x_tables autofs4 btrfs blake2b_generic raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear nvidia_drm(OE) nvidia_modeset(OE) nvidia(OE) bochs crct10dif_pclmul drm_vram_helper crc32_pclmul polyval_clmulni drm_ttm_helper polyval_generic ghash_clmulni_intel sha256_ssse3 ttm sha1_ssse3 aesni_intel drm_kms_helper crypto_simd cryptd video ahci i2c_i801 psmouse wmi drm libahci lpc_ich i2c_smbus ecc
[  886.497991] CR2: 00000000000152ba
[  886.498433] ---[ end trace 0000000000000000 ]---
[  886.507366] RIP: 0010:freeRpcInfrastructure_VGPU+0x25/0xe0 [nvidia]
[  886.508344] Code: ff eb 85 66 90 f3 0f 1e fa 55 48 89 e5 41 55 41 bd 40 00 00 00 41 54 53 48 83 ec 08 8b 87 e4 04 00 00 48 8b 1c c5 00 9a 8d c0 <80> bb ba 52 01 00 00 74 5f 45 31 ed 80 bf 83 02 00 00 00 49 89 fc
[  886.510762] RSP: 0018:ffffa750811eba60 EFLAGS: 00010286
[  886.511469] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000020
[  886.512428] RDX: 0000000000000001 RSI: ffffa750811eba54 RDI: ffff965f868c0020
[  886.513336] RBP: ffffa750811eba80 R08: ffffffffc0668208 R09: 0000000000000000
[  886.514291] R10: ffffa750811eb8a8 R11: 0000000000000000 R12: 0000000000000000
[  886.515249] R13: 0000000000000040 R14: 0000000000000000 R15: 00000000180000a1
[  886.516193] FS:  0000000000000000(0000) GS:ffff966ebf400000(0000) knlGS:0000000000000000
[  886.517244] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  886.517978] CR2: 00000000000152ba CR3: 0008000128026000 CR4: 00000000003506e0
[  886.518911] note: nv_open_q[523] exited with irqs disabled

Tan-YiFan commented 5 months ago

Have you created the file /etc/modprobe.d/nvidia-lkca.conf as https://docs.nvidia.com/confidential-computing-deployment-guide.pdf says? (Section Enabling LKCA on the Guest VM)

vishnukumarkalidasan commented 5 months ago

I just tried that. But I see the same pattern. The Nvidia-smi works first time and then the kernel crashes with page fault like my previous comment.

Should I downgrade the driver version? Driver Version: 550.54.14 CUDA Version: 12.4 is the current driver and toolkit setup. And I used 550 server-open-kernel version.

Tan-YiFan commented 5 months ago

I suggest using 535.86.10.

vishnukumarkalidasan commented 5 months ago

I have now installed the driver version you mentioned. Driver Version: 535.86.10 CUDA Version: 12.2

This time kernel is not crashing but fails with nvidia-smi the second time.

[   10.543945] nvidia: loading out-of-tree module taints kernel.
[   10.543960] nvidia: module license 'NVIDIA' taints kernel.
[   10.543962] Disabling lock debugging due to kernel taint
[   10.543966] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[   10.543967] nvidia: module license taints kernel.
[   10.581561] EXT4-fs (sda2): mounted filesystem 36c949ba-15e8-49bc-b19f-437c7b695fa0 r/w with ordered data mode. Quota mode: none.
[   10.641258] nvidia-nvlink: Nvlink Core is being initialized, major device number 240
[   10.641268] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  535.86.10  Wed Jul 26 23:20:03 UTC 2023
[   10.665060] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  535.86.10  Wed Jul 26 23:01:50 UTC 2023 
<skipping few logs>
[  147.672840] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x22:0x56:762)
[  147.675391] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0 
[  147.757253] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x22:0x56:762)
[  147.759663] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0

vishnukumarkalidasan commented 5 months ago

Okay, I think its fine now. I installed the driver again with -m=kernel-open. then enabled Persistence after reboot. with that the driver is not failing.

NVIDIA / nvtrust

Nvidia-gpu-tool shows GPUs are broken #51