NVIDIA / nvtrust

Ancillary open source software to support confidential computing on NVIDIA GPUs
Apache License 2.0
196 stars 26 forks source link

Nvidia drive wont load - nvidia tdx #72

Open YurkoWasHere opened 1 week ago

YurkoWasHere commented 1 week ago
root@(none):/# modprobe nvidia
[  729.263450] nvidia: loading out-of-tree module taints kernel.
[  729.268371] ------------[ cut here ]------------
[  729.269380] WARNING: CPU: 3 PID: 1958 at arch/x86/kernel/alternative.c:654 apply_retpolines+0x379/0x570
[  729.271337] Modules linked in: vmw_vsock_virtio_transport(E) vhost_vsock(E) vmw_vsock_virtio_transport_common(E) vsock(E) vhost(E) vhost_iotlb(E) dm_integrity(E) dm_bufio(E) async_xor(E) async_tx(E) xor(E) essiv(E) authenc(E) dm_crypt(E) tdx_guest(E) tsm(E)
[  729.275938] CPU: 3 PID: 1958 Comm: modprobe Tainted: G S         OE      6.9.0-rc7+ #1
[  729.277603] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown unknown
[  729.279240] RIP: 0010:apply_retpolines+0x379/0x570
[  729.280260] Code: 40 80 fe e9 0f 85 96 00 00 00 39 d1 0f 8d 40 ff ff ff 8d 71 01 48 63 c9 c6 44 0d c0 cc e9 51 ff ff ff 41 b8 e0 ff ff ff eb bb <0f> 0b e9 6b fd ff ff 0f 0b 0f b6 8d 69 ff ff ff 89 ce 83 e6 f0 40
[  729.284111] RSP: 0018:ffffa18c03b2bb18 EFLAGS: 00010216
[  729.285225] RAX: 0000000000c0156b RBX: ffffffffc03366ed RCX: 0000000000000005
[  729.286717] RDX: 0000000000000005 RSI: 0000000000000000 RDI: 0000000000000000
[  729.288244] RBP: ffffa18c03b2bbf0 R08: 0000000000000000 R09: 0000000000000000
[  729.289778] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[  729.291310] R13: ffffa18c03b2bb40 R14: ffffffffc0b70b6a R15: ffffffffc0b70af6
[  729.292820] FS:  00007bcfc7115c40(0000) GS:ffff8d213fb80000(0000) knlGS:0000000000000000
[  729.294521] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  729.295740] CR2: 00007bcfc55a1ff0 CR3: 00000005fce96001 CR4: 00000000007706f0
[  729.297243] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  729.298724] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
[  729.300252] PKRU: 55555554
[  729.300848] Call Trace:
[  729.301403]  <TASK>
[  729.301882]  ? show_regs+0x6c/0x80
[  729.302620]  ? __warn+0x88/0x140
[  729.303337]  ? apply_retpolines+0x379/0x570
[  729.304223]  ? report_bug+0x182/0x1b0
[  729.305023]  ? handle_bug+0x51/0xa0
[  729.305781]  ? exc_invalid_op+0x18/0x80
[  729.306603]  ? asm_exc_invalid_op+0x1b/0x20
[  729.307534]  ? apply_retpolines+0x379/0x570
[  729.308455]  ? apply_retpolines+0x125/0x570
[  729.309343]  ? sysvec_apic_timer_interrupt+0x57/0xc0
[  729.310437]  module_finalize+0x1b9/0x330
[  729.311293]  ? add_kallsyms+0x2bd/0x350
[  729.301882]  ? show_regs+0x6c/0x80
[  729.302620]  ? __warn+0x88/0x140
[  729.303337]  ? apply_retpolines+0x379/0x570
[  729.304223]  ? report_bug+0x182/0x1b0
[  729.305023]  ? handle_bug+0x51/0xa0
[  729.305781]  ? exc_invalid_op+0x18/0x80
[  729.306603]  ? asm_exc_invalid_op+0x1b/0x20
[  729.307534]  ? apply_retpolines+0x379/0x570
[  729.308455]  ? apply_retpolines+0x125/0x570
[  729.309343]  ? sysvec_apic_timer_interrupt+0x57/0xc0
[  729.310437]  module_finalize+0x1b9/0x330
[  729.311293]  ? add_kallsyms+0x2bd/0x350
[  729.312120]  load_module+0x1b8a/0x2320
[  729.312991]  ? asm_sysvec_apic_timer_interrupt+0x1b/0x20
[  729.314135]  __do_sys_init_module+0x1a7/0x1e0
[  729.315111]  ? __do_sys_init_module+0x1a7/0x1e0
[  729.316112]  __x64_sys_init_module+0x1a/0x30
[  729.317078]  x64_sys_call+0x14aa/0x25c0
[  729.317943]  do_syscall_64+0x7e/0x180
[  729.318732]  ? irqentry_exit+0x43/0x50
[  729.319553]  ? exc_page_fault+0x93/0x1b0
[  729.320394]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[  729.321476] RIP: 0033:0x7bcfc6926bde
[  729.322255] Code: 48 8b 0d 55 32 0f 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 af 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 22 32 0f 00 f7 d8 64 89 01 48
[  729.326087] RSP: 002b:00007ffe18363928 EFLAGS: 00000246 ORIG_RAX: 00000000000000af
[  729.327701] RAX: ffffffffffffffda RBX: 000058f448bf28c0 RCX: 00007bcfc6926bde
[  729.329248] RDX: 000058f41395ccd2 RSI: 00000000017a4ff1 RDI: 00007bcfc3e00010
[  729.330769] RBP: 00007bcfc3e00010 R08: 00007bcfc70ed000 R09: 85ebca77c2b2ae63
[  729.332297] R10: 0000000000001fff R11: 0000000000000246 R12: 000058f41395ccd2
[  729.333809] R13: 000058f448bf2940 R14: 000058f448bfa4b0 R15: 000058f448bf29f0
[  729.335307]  </TASK>
[  729.335786] ---[ end trace 0000000000000000 ]---
[  729.399521] nvidia-nvlink: Nvlink Core is being initialized, major device number 239
[  729.401216] NVRM: loading NVIDIA UNIX Open Kernel Module for x86_64  550.54.15  Release Build  (root@vcomp-01)  Wed Oct 16 15:09:02 UTC 2024
lsmod

Module                  Size  Used by
nvidia               8671232  0
vmw_vsock_virtio_transport    20480  0
vhost_vsock            24576  0
vmw_vsock_virtio_transport_common    57344  2 vhost_vsock,vmw_vsock_virtio_transport
vsock                  61440  3 vmw_vsock_virtio_transport_common,vhost_vsock,vmw_vsock_virtio_transport
vhost                  65536  1 vhost_vsock
vhost_iotlb            16384  1 vhost
dm_integrity           81920  1
dm_bufio               57344  1 dm_integrity
async_xor              16384  1 dm_integrity
async_tx               16384  1 async_xor
xor                    20480  1 async_xor
essiv                  16384  0
authenc                12288  2 essiv
dm_crypt               65536  2
tdx_guest              12288  0
tsm                    16384  1 tdx_guest
root@(none):/#
$ nvidia-persistenced
[  795.927271] ACPI Warning: \_SB.PCI0.S20.S00._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20230628/nsarguments-61)

[  799.260398] NVRM: nvAssertFailed: Assertion failed: 0 @ g_kern_gmmu_nvoc.h:1967
[  799.261730] NVRM: nvAssertFailed: Assertion failed: 0 @ g_kern_gmmu_nvoc.h:1967
[  799.292281] NVRM: nvAssertOkFailedNoLog: Assertion failed: Invalid data passed [NV_ERR_INVALID_DATA] (0x00000025) returned from pRmApi->Control(pRmApi, pGpu->hInternalClient, pGpu->hInternalSubdevice, NV2080_CTRL_CMD_INTERNAL_MEMMGR_MEMORY_TRANSFER_WITH_GSP, &gspParams, sizeof(gspParams)) @ mem_utils.c:283
[  799.297194] NVRM: nvAssertOkFailedNoLog: Assertion failed: Invalid data passed [NV_ERR_INVALID_DATA] (0x00000025) returned from _memmgrMemReadOrWriteWithGsp(pGpu, pDstInfo, pBuf, size, NV_FALSE ) @ mem_utils.c:864
[  799.300531] NVRM: nvAssertFailedNoLog: Assertion failed: status == NV_OK @ mem_mgr.c:607
[  799.302046] NVRM: nvAssertOkFailedNoLog: Assertion failed: Invalid data passed [NV_ERR_INVALID_DATA] (0x00000025) returned from memmgrVerifyGspDmaOps(pGpu, GPU_GET_MEMORY_MANAGER(pGpu)) @ kern_bus_gm107.c:385
[  799.305279] NVRM: RmInitNvDevice: *** Cannot initialize the device
[  799.306416] NVRM: RmInitAdapter: RmInitNvDevice failed, bailing out of RmInitAdapter
[  799.307816] NVRM: rmapiReportLeakedDevices: Device object leak: (0xc1e00004, 0xcaf00000). Please file a bug against RM-core.
[  799.309808] NVRM: nvAssertFailedNoLog: Assertion failed: 0 @ rmapi.c:961
[  799.944604] NVOC: __nvoc_objDelete: Child class PrereqTracker not freed from parent class OBJGPU.NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x24:0x25:1025)
[  799.949341] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[  800.134903] NVRM: gpumgrCheckRmFirmwarePolicy: Disabling GSP offload -- GPU not supported
[  800.136544] NVRM: nvAssertFailedNoLog: Assertion failed: GPU_GET_VGPU(pGpu) != NULL @ gpu.c:5282
[  800.138235] NVRM: nvAssertFailedNoLog: Assertion failed: pRpc != NULL @ objvgpu.c:150
[  800.139697] BUG: unable to handle page fault for address: 00000000000152ba
[  800.141069] #PF: supervisor read access in kernel mode
[  800.142114] #PF: error_code(0x0000) - not-present page
[  800.143086] PGD 6cc855067 P4D 6cc855067 PUD 6f8f74067 PMD 0
[  800.144245] Oops: 0000 [#1] PREEMPT SMP NOPTI
[  800.145115] CPU: 1 PID: 1965 Comm: nvidia-persiste Tainted: G S      W  OE      6.9.0-rc7+ #1
[  800.146837] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown unknown
[  800.148383] RIP: 0010:freeRpcInfrastructure_VGPU+0x25/0xf0 [nvidia]
[  800.149919] Code: 00 00 00 00 00 f3 0f 1e fa 55 48 89 e5 41 55 41 bd 40 00 00 00 41 54 53 48 83 ec 08 8b 87 e4 04 00 00 48 8b 1c c5 40 fe 6d c0 <80> bb ba 52 01 00 00 74 53 31 d2 48 89 de 49 89 fc e8 e5 cc ff ff
[  800.153351] RSP: 0018:ffffa18c03b1b590 EFLAGS: 00010286
[  800.154357] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000020
[  800.155723] RDX: 0000000000000001 RSI: ffffa18c03b1b584 RDI: ffff8d120277c020
[  800.157084] RBP: ffffa18c03b1b5b0 R08: ffffa18c03b1b3c0 R09: 0000000000000000
[  800.158501] R10: ffffa18c03b1b3f8 R11: ffffffffc0473ac8 R12: 0000000000000000
[  800.159895] R13: 0000000000000040 R14: ffff8d120277c020 R15: 0000000000000056
[  800.161233] FS:  00007e1f051ab740(0000) GS:ffff8d213fa80000(0000) knlGS:0000000000000000
[  800.162862] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  800.163995] CR2: 00000000000152ba CR3: 00000006ff43a002 CR4: 00000000007706f0
[  800.165375] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  800.166768] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
[  800.168133] PKRU: 55555554
[  800.168646] Call Trace:
[  800.169123]  <TASK>
[  800.169563]  ? show_regs+0x6c/0x80
[  800.170214]  ? __die+0x24/0x80
[  800.170839]  ? page_fault_oops+0x175/0x5a0
[  800.171614]  ? do_user_addr_fault+0x306/0x6f0
[  800.172449]  ? rcdbAddAssertJournalRecWithLine+0x1f3/0x260 [nvidia]
[  800.173916]  ? exc_page_fault+0x82/0x1b0
[  800.174663]  ? asm_exc_page_fault+0x27/0x30
[  800.175522]  ? vgpuDestructObject+0xf8/0x110 [nvidia]
[  800.176740]  ? freeRpcInfrastructure_VGPU+0x25/0xf0 [nvidia]
[  800.177985]  vgpuDestructObject+0x5b/0x110 [nvidia]
[  800.179080]  gpuDestruct_IMPL+0x377/0x3e0 [nvidia]
[  800.180196]  __nvoc_dtor_OBJGPU+0x15/0x40 [nvidia]
[  800.181281]  __nvoc_objDelete+0x2c/0xf0 [nvidia]
[  800.182321]  gpumgrAttachGpu+0xabc/0xe40 [nvidia]
[  800.183391]  RmInitAdapter+0x467/0x19d0 [nvidia]
[  800.184448]  ? os_alloc_mem+0xd5/0xf0 [nvidia]
[  800.185516]  ? os_get_current_tick+0x3c/0xb0 [nvidia]
[  800.186642]  ? _raw_spin_unlock_irqrestore+0x11/0x60
[  800.187642]  rm_init_adapter+0xa5/0xb0 [nvidia]
[  800.188740]  nv_open_device+0x200/0xa30 [nvidia]
[  800.189821]  nvidia_open+0x228/0x510 [nvidia]
[  800.190761]  chrdev_open+0xcf/0x250
[  800.191460]  ? __pfx_chrdev_open+0x10/0x10
[  800.192229]  do_dentry_open+0x21a/0x570
[  800.192954]  vfs_open+0x33/0x50
[  800.193556]  path_openat+0xaf8/0x1170
[  800.194297]  ? __slab_free+0xdf/0x2a0
[  800.195035]  do_filp_open+0xc0/0x170
[  800.195705]  do_sys_openat2+0xb3/0xe0
[  800.196382]  __x64_sys_openat+0x55/0xa0
[  800.197177]  x64_sys_call+0x1eb8/0x25c0
[  800.197995]  do_syscall_64+0x7e/0x180
[  800.198734]  ? syscall_exit_to_user_mode+0x81/0x270
[  800.199719]  ? do_syscall_64+0x8b/0x180
[  800.200530]  ? close_pdeo+0x111/0x130
[  800.201251]  ? __fput+0x15e/0x2e0
[  800.201932]  ? syscall_exit_to_user_mode+0x81/0x270
[  800.202929]  ? do_syscall_64+0x8b/0x180
[  800.203691]  ? do_syscall_64+0x8b/0x180
[  800.204469]  ? syscall_exit_to_user_mode+0x81/0x270
[  800.205452]  ? do_syscall_64+0x8b/0x180
[  800.206231]  ? exc_page_fault+0x93/0x1b0
[  800.206976]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[  800.207948] RIP: 0033:0x7e1f04f1453b
[  800.208627] Code: 25 00 00 41 00 3d 00 00 41 00 74 4b 64 8b 04 25 18 00 00 00 85 c0 75 67 44 89 e2 48 89 ee bf 9c ff ff ff b8 01 01
00 00 0f 05 <48> 3d 00 f0 ff ff 0f 87 91 00 00 00 48 8b 54 24 28 64 48 2b 14 25
[  800.212185] RSP: 002b:00007ffe97acf2c0 EFLAGS: 00000246 ORIG_RAX: 0000000000000101
[  800.213667] RAX: ffffffffffffffda RBX: 00007ffe97acf360 RCX: 00007e1f04f1453b
[  800.215085] RDX: 0000000000000002 RSI: 00007ffe97acf360 RDI: 00000000ffffff9c
[  800.216462] RBP: 00007ffe97acf360 R08: 0000000000000000 R09: 00007ffe97acf1c0
[  800.217768] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000002
[  800.219189] R13: 00007ffe97acf4ac R14: 0000000000000002 R15: 00007e1f04a611a0
[  800.220508]  </TASK>
[  800.220981] Modules linked in: nvidia(O) vmw_vsock_virtio_transport(E) vhost_vsock(E) vmw_vsock_virtio_transport_common(E) vsock(E)
vhost(E) vhost_iotlb(E) dm_integrity(E) dm_bufio(E) async_xor(E) async_tx(E) xor(E) essiv(E) authenc(E) dm_crypt(E) tdx_guest(E) tsm(E)
[  800.225333] CR2: 00000000000152ba
[  800.225976] ---[ end trace 0000000000000000 ]---
[  800.226841] RIP: 0010:freeRpcInfrastructure_VGPU+0x25/0xf0 [nvidia]
[  800.228277] Code: 00 00 00 00 00 f3 0f 1e fa 55 48 89 e5 41 55 41 bd 40 00 00 00 41 54 53 48 83 ec 08 8b 87 e4 04 00 00 48 8b 1c c5
40 fe 6d c0 <80> bb ba 52 01 00 00 74 53 31 d2 48 89 de 49 89 fc e8 e5 cc ff ff
[  800.231855] RSP: 0018:ffffa18c03b1b590 EFLAGS: 00010286
[  800.232855] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000020
[  800.232855] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000020
[  800.234278] RDX: 0000000000000001 RSI: ffffa18c03b1b584 RDI: ffff8d120277c020
[  800.235588] RBP: ffffa18c03b1b5b0 R08: ffffa18c03b1b3c0 R09: 0000000000000000
[  800.236957] R10: ffffa18c03b1b3f8 R11: ffffffffc0473ac8 R12: 0000000000000000
[  800.238386] R13: 0000000000000040 R14: ffff8d120277c020 R15: 0000000000000056
[  800.239727] FS:  00007e1f051ab740(0000) GS:ffff8d213fa80000(0000) knlGS:0000000000000000
[  800.241254] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  800.242355] CR2: 00000000000152ba CR3: 00000006ff43a002 CR4: 00000000007706f0
[  800.243708] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  800.245065] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
[  800.246417] PKRU: 55555554
[  800.246941] note: nvidia-persiste[1965] exited with irqs disabled
[  800.248258] NVRM: nvAssertFailedNoLog: Assertion failed: !rmapiLockIsOwner() @ rmapi.c:623
[  800.249798] NVRM: nvAssertFailedNoLog: Assertion failed: rmStatus == NV_OK @ thread_state.c:845
nvidia-persistenced failed to initialize. Check syslog for more details.
root@(none):/#
root@(none):/#
Tan-YiFan commented 1 week ago

NV_ERR_INVALID_DATA is possibly caused by the guest kernel not supporting aesni_intel kernel module (which is absent from lsmod shown above).

YurkoWasHere commented 1 week ago

I rebuilt using the stock ubuntu kernel and seems to work.

Get a weird error about backlight but it attests and nvidia-smi picks up the H100 so i think were good?

[  313.990837] cryptd: max_cpu_qlen set to 1000
[  314.013713] SSE version of gcm_enc/dec engaged.
[  314.044416] nvidia: loading out-of-tree module taints kernel.
[  314.047850] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[  314.145657] nvidia-nvlink: Nvlink Core is being initialized, major device number 239
[  314.147133] NVRM: loading NVIDIA UNIX Open Kernel Module for x86_64  550.54.15  Release Build  (root)  Thu Oct 17 21:25:54 UTC 2024
[  314.277332] nvidia-uvm: Loaded the UVM driver, major device number 237.
[  314.543329] ACPI Warning: \_SB.PCI0.S20.S00._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20230628/nsarguments-61)
[  317.816613] NVRM: nvAssertFailed: Assertion failed: 0 @ g_kern_gmmu_nvoc.h:1967
[  317.818042] NVRM: nvAssertFailed: Assertion failed: 0 @ g_kern_gmmu_nvoc.h:1967
[  318.352323] NVRM: nvCheckOkFailedNoLog: Check failed: Call not supported [NV_ERR_NOT_SUPPORTED] (0x00000056) returned from pEntry->pCallback(pGpu, pEntry->pCallbackParam) @ kernel_fifo.c:3021
[  318.380366] nvidia_modeset: Unknown symbol __acpi_video_get_backlight_type (err -2)
Tan-YiFan commented 6 days ago

These error logs are expected.