Closed vishnukumarkalidasan closed 3 months ago
Here is the lspci -vvv
output of one of the GPUs
61:00.0 3D controller: NVIDIA Corporation Device 2331 (rev a1)
Subsystem: NVIDIA Corporation Device 1626
Control: I/O- Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Interrupt: pin A routed to IRQ 255
NUMA node: 0
IOMMU group: 64
Region 0: Memory at 1e042000000 (64-bit, prefetchable) [size=16M]
Region 2: Memory at 1a000000000 (64-bit, prefetchable) [size=128G]
Region 4: Memory at 1e040000000 (64-bit, prefetchable) [size=32M]
Capabilities: [40] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [48] MSI: Enable- Count=1/16 Maskable+ 64bit+
Address: 0000000000000000 Data: 0000
Masking: 00000000 Pending: 00000000
Capabilities: [60] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us
ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 75.000W
DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq-
RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
MaxPayload 256 bytes, MaxReadReq 512 bytes
DevSta: CorrErr+ NonFatalErr- FatalErr- UnsupReq+ AuxPwr- TransPend-
LnkCap: Port #0, Speed 32GT/s, Width x16, ASPM L1, Exit Latency L1 <4us
ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 16GT/s (downgraded), Width x16 (ok)
TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range AB, TimeoutDis+ NROPrPrP- LTR+
10BitTagComp+ 10BitTagReq+ OBFF Via message, ExtFmt- EETLPPrefix-
EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
FRS- TPHComp- ExtTPHComp-
AtomicOpsCap: 32bit- 64bit- 128bitCAS-
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR+ OBFF Disabled,
AtomicOpsCtl: ReqEn-
LnkCap2: Supported Link Speeds: 2.5-32GT/s, Crosslink- Retimer+ 2Retimers+ DRS-
LnkCtl2: Target Link Speed: 32GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+ EqualizationPhase1+
EqualizationPhase2+ EqualizationPhase3+ LinkEqualizationRequest-
Retimer- 2Retimers- CrosslinkRes: unsupported
Capabilities: [9c] Vendor Specific Information: Len=14 <?>
Capabilities: [b0] MSI-X: Enable- Count=9 Masked-
Vector table: BAR=0 offset=00b90000
PBA: BAR=0 offset=00ba0000
Capabilities: [100 v1] Secondary PCI Express
LnkCtl3: LnkEquIntrruptEn- PerformEqu-
LaneErrStat: 0
Capabilities: [12c v1] Latency Tolerance Reporting
Max snoop latency: 1048576ns
Max no snoop latency: 1048576ns
Capabilities: [134 v1] Physical Resizable BAR
BAR 2: current size: 128GB, supported: 64MB 128MB 256MB 512MB 1GB 2GB 4GB 8GB 16GB 32GB 64GB 128GB
Capabilities: [140 v1] Virtual Resizable BAR
BAR 1: current size: 4GB, supported: 64MB 128MB 256MB 512MB 1GB 2GB 4GB 8GB 16GB 32GB 64GB 128GB 256TB 512TB 1PB 2PB 4PB 8PB 16PB 32PB 64PB 128PB 256PB 512PB 1EB 2EB 4EB 8EB
Capabilities: [14c v1] Data Link Feature <?>
Capabilities: [158 v1] Physical Layer 16.0 GT/s <?>
Capabilities: [188 v1] Extended Capability ID 0x2a
Capabilities: [1b8 v2] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO+ CmpltAbrt- UnxCmplt+ RxOF+ MalfTLP+ ECRC+ UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
AERCap: First Error Pointer: 00, ECRCGenCap+ ECRCGenEn+ ECRCChkCap+ ECRCChkEn+
MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
HeaderLog: 00000000 00000000 00000000 00000000
Capabilities: [200 v1] Lane Margining at the Receiver <?>
Capabilities: [248 v1] Alternative Routing-ID Interpretation (ARI)
ARICap: MFVC- ACS-, Next Function: 0
ARICtl: MFVC- ACS-, Function Group: 0
Capabilities: [250 v1] Single Root I/O Virtualization (SR-IOV)
IOVCap: Migration-, Interrupt Message Number: 000
IOVCtl: Enable- Migration- Interrupt- MSE- ARIHierarchy+
IOVSta: Migration-
Initial VFs: 32, Total VFs: 32, Number of VFs: 0, Function Dependency Link: 00
VF offset: 2, stride: 1, Device ID: 2331
Supported Page Size: 00000573, System Page Size: 00000001
Region 0: Memory at ae200000 (32-bit, non-prefetchable)
Region 1: Memory at 000001c000000000 (64-bit, prefetchable)
Region 3: Memory at 000001e000000000 (64-bit, prefetchable)
VF Migration: offset: 00000000, BIR: 0
Capabilities: [2a4 v1] Vendor Specific Information: ID=0001 Rev=1 Len=014 <?>
Capabilities: [2b8 v1] Power Budgeting <?>
Capabilities: [2c8 v1] Extended Capability ID 0x2e
Capabilities: [2f0 v1] Device Serial Number 4b-49-cd-a7-48-2d-b0-48
Kernel modules: nvidiafb, nouveau
Update 1:
enabled CC in both H100s via sysfs mmio access.
root@tiger:/shared/nvtrust/host_tools/python/gpu-admin-tools# python3 nvidia_gpu_tools.py --gpu=2 --mmio-access-type sysfs
NVIDIA GPU Tools version v2024.02.14o
Command line arguments: ['nvidia_gpu_tools.py', '--gpu=2', '--mmio-access-type', 'sysfs']
2024-04-05,11:44:26.708 WARNING GPU 0000:41:00.0 ? 0x2331 BAR0 0x0 was in D3, forced power control to on (prev auto). New state D0
2024-04-05,11:44:26.760 WARNING GPU 0000:61:00.0 ? 0x2331 BAR0 0x0 was in D3, forced power control to on (prev auto). New state D0
GPUs:
0 GPU 0000:01:00.0 A10 0x2231 BAR0 0xf8000000
1 GPU 0000:41:00.0 H100-PCIE 0x2331 BAR0 0x26042000000
2 GPU 0000:61:00.0 H100-PCIE 0x2331 BAR0 0x1e042000000
Other:
Topo:
PCI 0000:00:01.1 0x1022:0x1483
GPU 0000:01:00.0 A10 0x2231 BAR0 0xf8000000
PCI 0000:40:01.1 0x1022:0x1483
GPU 0000:41:00.0 H100-PCIE 0x2331 BAR0 0x26042000000
PCI 0000:60:03.1 0x1022:0x1483
GPU 0000:61:00.0 H100-PCIE 0x2331 BAR0 0x1e042000000
2024-04-05,11:44:26.760 INFO Selected GPU 0000:61:00.0 H100-PCIE 0x2331 BAR0 0x1e042000000
2024-04-05,11:44:26.760 WARNING GPU 0000:61:00.0 H100-PCIE 0x2331 BAR0 0x1e042000000 has CC mode on, some functionality may not work
2024-04-05,11:44:26.761 WARNING GPU 0000:61:00.0 H100-PCIE 0x2331 BAR0 0x1e042000000 restoring power control to auto
2024-04-05,11:44:26.761 WARNING GPU 0000:41:00.0 H100-PCIE 0x2331 BAR0 0x26042000000 restoring power control to auto
root@tiger:/shared/nvtrust/host_tools/python/gpu-admin-tools# python3 nvidia_gpu_tools.py --gpu-name=H100 --mmio-access-type sysfs
NVIDIA GPU Tools version v2024.02.14o
Command line arguments: ['nvidia_gpu_tools.py', '--gpu-name=H100', '--mmio-access-type', 'sysfs']
2024-04-05,11:43:44.596 WARNING GPU 0000:41:00.0 ? 0x2331 BAR0 0x0 was in D3, forced power control to on (prev auto). New state D0
2024-04-05,11:43:44.648 WARNING GPU 0000:61:00.0 ? 0x2331 BAR0 0x0 was in D3, forced power control to on (prev auto). New state D0
AMD_SEV_DIR=/shared/AMDSEV/snp-release-2023-12-19
Topo:
PCI 0000:00:01.1 0x1022:0x1483
GPU 0000:01:00.0 A10 0x2231 BAR0 0xf8000000
PCI 0000:40:01.1 0x1022:0x1483
GPU 0000:41:00.0 H100-PCIE 0x2331 BAR0 0x26042000000
PCI 0000:60:03.1 0x1022:0x1483
GPU 0000:61:00.0 H100-PCIE 0x2331 BAR0 0x1e042000000
2024-04-05,11:43:44.648 INFO Selected GPU 0000:41:00.0 H100-PCIE 0x2331 BAR0 0x26042000000
2024-04-05,11:43:44.648 WARNING GPU 0000:41:00.0 H100-PCIE 0x2331 BAR0 0x26042000000 has CC mode on, some functionality may not work
2024-04-05,11:43:44.649 WARNING GPU 0000:61:00.0 H100-PCIE 0x2331 BAR0 0x1e042000000 restoring power control to auto
2024-04-05,11:43:44.649 WARNING GPU 0000:41:00.0 H100-PCIE 0x2331 BAR0 0x26042000000 restoring power control to auto
update 2:
GPU fails to load inside the VM:
dmesg | tail
logs
[ 556.039789] NVRM spdmStart_IMPL: SPDM: Certificate retrieval failed!
[ 556.039795] NVRM spdmStart_IMPL: SPDM: Session establishment failed!
[ 556.039807] NVRM nvCheckOkFailedNoLog: Check failed: Invalid data passed [NV_ERR_INVALID_DATA] (0x00000025) returned from spdmStart(pGpu, pConfCompute->pSpdm) @ conf_compute.c:238
[ 556.039812] NVRM RmInitNvDevice: *** Cannot pre-initialize the device
[ 556.039814] NVRM RmInitAdapter: RmInitNvDevice failed, bailing out of RmInitAdapter
[ 556.039835] NVOC: __nvoc_objDelete: Child class Spdm not freed from parent class ConfidentialCompute.NVOC: __nvoc_objDelete: Child class GenericKernelFalcon not freed from parent class OBJGPU.NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x23:0x25:899)
[ 556.748085] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[ 556.850711] NVRM gpumgrCheckRmFirmwarePolicy: Disabling GSP offload -- GPU not supported
[ 556.850734] NVRM osInitNvMapping: *** Cannot attach gpu
[ 556.850737] NVRM RmInitAdapter: osInitNvMapping failed, bailing out of RmInitAdapter
[ 556.850750] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x22:0x56:631)
[ 556.852912] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[ 556.961720] nvidia-uvm: Loaded the UVM driver, major device number 238.
nvcc
output
root@sev:~# nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Jul_11_02:20:44_PDT_2023
Cuda compilation tools, release 12.2, V12.2.128
Build cuda_12.2.r12.2/compiler.33053471_0
driver and toolkit installed via cuda_12.2.1_535.86.10_linux.run
Could you provide the guest boot script? And what version of Linux is the guest running (uname -a
in the guest)?
If I got the output SPDM: Certificate retrieval failed!
, I would hack the driver, adding some prints to find out what's going wrong.
SPDM: Certificate retrieval failed!
usually means the ecdsa_generic
and/or ecdh
kernel modules have not been loaded.
Could you provide the guest boot script? And what version of Linux is the guest running (
uname -a
in the guest)?If I got the output
SPDM: Certificate retrieval failed!
, I would hack the driver, adding some prints to find out what's going wrong.
root@sev:~# uname -a
Linux sev 6.5.0-26-generic #26~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Tue Mar 12 10:22:43 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
this is the output of uname -a
SPDM: Certificate retrieval failed!
usually means the
ecdsa_generic
and/orecdh
kernel modules have not been loaded.
Okay, So I did modprobe of both ecdsa
and ecdh_generic
and then rescanned the PCI bus. Now i can see the GPU in nvidia-smi
usage count for both modules are 0
Module Size Used by ecdh_generic 16384 0
ecdsa_generic 16384 0
ecc 45056 3 ecdh_generic,ecdsa_generic,nvidia
here is the output
but when I ran the nvidia-smi the second time. it failed with page fault and kernel crash.
[ 886.434042] BUG: unable to handle page fault for address: 00000000000152ba
[ 886.435780] #PF: supervisor read access in kernel mode
[ 886.436748] #PF: error_code(0x0000) - not-present page [ 886.438155] PGD 0 P4D 0
[ 886.438942] Oops: 0000 [#1] PREEMPT SMP NOPTI
[ 886.439843] CPU: 24 PID: 523 Comm: nv_open_q Tainted: G OE 6.5.0-26-generic #26~22.04.1-Ubuntu
[ 886.442057] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown unknown
[ 886.443964] RIP: 0010:freeRpcInfrastructure_VGPU+0x25/0xe0 [nvidia]
[ 886.445151] Code: ff eb 85 66 90 f3 0f 1e fa 55 48 89 e5 41 55 41 bd 40 00 00 00 41 54 53 48 83 ec 08 8b 87 e4 04 00 00 48 8b 1c c5 00 9a 8d c0 <80> bb ba 52 01 00 00 74 5f 45 31 ed 80 bf 83 02 00 00 00 49 89 fc
[ 886.449557] RSP: 0018:ffffa750811eba60 EFLAGS: 00010286
[ 886.451053] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000020
[ 886.452405] RDX: 0000000000000001 RSI: ffffa750811eba54 RDI: ffff965f868c0020
[ 886.453594] RBP: ffffa750811eba80 R08: ffffffffc0668208 R09: 0000000000000000
[ 886.455593] R10: ffffa750811eb8a8 R11: 0000000000000000 R12: 0000000000000000
[ 886.456994] R13: 0000000000000040 R14: 0000000000000000 R15: 00000000180000a1
[ 886.458996] FS: 0000000000000000(0000) GS:ffff966ebf400000(0000) knlGS:0000000000000000
[ 886.460546] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 886.461485] CR2: 00000000000152ba CR3: 00080009fda3a000 CR4: 00000000003506e0
[ 886.463525] Call Trace:
[ 886.463981] <TASK>
[ 886.464340] ? show_regs+0x6d/0x80
[ 886.465294] ? __die+0x24/0x80
[ 886.466147] ? page_fault_oops+0x99/0x1b0
[ 886.467322] ? do_user_addr_fault+0x31d/0x6b0
[ 886.468305] ? exc_page_fault+0x83/0x1b0
[ 886.469205] ? asm_exc_page_fault+0x27/0x30
[ 886.470151] ? vgpuDestructObject+0xf8/0x110 [nvidia]
[ 886.471584] ? freeRpcInfrastructure_VGPU+0x25/0xe0 [nvidia]
[ 886.472787] vgpuDestructObject+0x5b/0x110 [nvidia]
[ 886.474083] gpuDestruct_IMPL+0x377/0x3e0 [nvidia]
[ 886.475463] __nvoc_dtor_OBJGPU+0x15/0x40 [nvidia]
[ 886.476732] __nvoc_objDelete+0x2c/0xf0 [nvidia]
[ 886.477958] gpumgrAttachGpu+0x90a/0xea0 [nvidia]
[ 886.479218] RmInitAdapter+0x5ad/0x19b0 [nvidia]
[ 886.479979] ? srso_return_thunk+0x5/0x10
[ 886.480510] ? srso_return_thunk+0x5/0x10
[ 886.481050] ? srso_return_thunk+0x5/0x10
[ 886.481584] ? _raw_spin_lock_irqsave+0xe/0x20
[ 886.482166] ? srso_return_thunk+0x5/0x10
[ 886.482595] rm_init_adapter+0xad/0xc0 [nvidia]
[ 886.483322] nv_open_device+0x42b/0xa20 [nvidia]
[ 886.484022] nvidia_open_deferred+0x39/0xb0 [nvidia]
[ 886.484757] _main_loop+0x82/0x140 [nvidia]
[ 886.485377] ? __pfx__main_loop+0x10/0x10 [nvidia]
[ 886.486085] kthread+0xf2/0x120
[ 886.486522] ? __pfx_kthread+0x10/0x10
[ 886.487020] ret_from_fork+0x47/0x70
[ 886.487506] ? __pfx_kthread+0x10/0x10
[ 886.488013] ret_from_fork_asm+0x1b/0x30
[ 886.488537] </TASK>
[ 886.488829] Modules linked in: ecdh_generic ecdsa_generic binfmt_misc nls_iso8859_1 intel_rapl_msr intel_rapl_common ppdev input_leds parport_pc sev_guest serio_raw parport nvidia_uvm(OE) mac_hid qemu_fw_cfg sch_fq_codel dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua ramoops msr reed_solomon pstore_blk pstore_zone efi_pstore ip_tables x_tables autofs4 btrfs blake2b_generic raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear nvidia_drm(OE) nvidia_modeset(OE) nvidia(OE) bochs crct10dif_pclmul drm_vram_helper crc32_pclmul polyval_clmulni drm_ttm_helper polyval_generic ghash_clmulni_intel sha256_ssse3 ttm sha1_ssse3 aesni_intel drm_kms_helper crypto_simd cryptd video ahci i2c_i801 psmouse wmi drm libahci lpc_ich i2c_smbus ecc
[ 886.497991] CR2: 00000000000152ba
[ 886.498433] ---[ end trace 0000000000000000 ]---
[ 886.507366] RIP: 0010:freeRpcInfrastructure_VGPU+0x25/0xe0 [nvidia]
[ 886.508344] Code: ff eb 85 66 90 f3 0f 1e fa 55 48 89 e5 41 55 41 bd 40 00 00 00 41 54 53 48 83 ec 08 8b 87 e4 04 00 00 48 8b 1c c5 00 9a 8d c0 <80> bb ba 52 01 00 00 74 5f 45 31 ed 80 bf 83 02 00 00 00 49 89 fc
[ 886.510762] RSP: 0018:ffffa750811eba60 EFLAGS: 00010286
[ 886.511469] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000020
[ 886.512428] RDX: 0000000000000001 RSI: ffffa750811eba54 RDI: ffff965f868c0020
[ 886.513336] RBP: ffffa750811eba80 R08: ffffffffc0668208 R09: 0000000000000000
[ 886.514291] R10: ffffa750811eb8a8 R11: 0000000000000000 R12: 0000000000000000
[ 886.515249] R13: 0000000000000040 R14: 0000000000000000 R15: 00000000180000a1
[ 886.516193] FS: 0000000000000000(0000) GS:ffff966ebf400000(0000) knlGS:0000000000000000
[ 886.517244] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 886.517978] CR2: 00000000000152ba CR3: 0008000128026000 CR4: 00000000003506e0
[ 886.518911] note: nv_open_q[523] exited with irqs disabled
Have you created the file /etc/modprobe.d/nvidia-lkca.conf as https://docs.nvidia.com/confidential-computing-deployment-guide.pdf says? (Section Enabling LKCA on the Guest VM
)
I just tried that. But I see the same pattern. The Nvidia-smi works first time and then the kernel crashes with page fault like my previous comment.
Should I downgrade the driver version? Driver Version: 550.54.14 CUDA Version: 12.4
is the current driver and toolkit setup. And I used 550 server-open-kernel version.
I suggest using 535.86.10.
I have now installed the driver version you mentioned. Driver Version: 535.86.10 CUDA Version: 12.2
This time kernel is not crashing but fails with nvidia-smi
the second time.
[ 10.543945] nvidia: loading out-of-tree module taints kernel.
[ 10.543960] nvidia: module license 'NVIDIA' taints kernel.
[ 10.543962] Disabling lock debugging due to kernel taint
[ 10.543966] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[ 10.543967] nvidia: module license taints kernel.
[ 10.581561] EXT4-fs (sda2): mounted filesystem 36c949ba-15e8-49bc-b19f-437c7b695fa0 r/w with ordered data mode. Quota mode: none.
[ 10.641258] nvidia-nvlink: Nvlink Core is being initialized, major device number 240
[ 10.641268] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 535.86.10 Wed Jul 26 23:20:03 UTC 2023
[ 10.665060] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 535.86.10 Wed Jul 26 23:01:50 UTC 2023
<skipping few logs>
[ 147.672840] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x22:0x56:762)
[ 147.675391] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[ 147.757253] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x22:0x56:762)
[ 147.759663] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
Okay, I think its fine now. I installed the driver again with -m=kernel-open
. then enabled Persistence after reboot. with that the driver is not failing.
System background: A5000 + 2xH100 graphic card setup AMD EPYC 7643 48-Core Processor
lspci -vd 10de:
--> outputlscpu
--> output