NVIDIA / nvtrust

Ancillary open source software to support confidential computing on NVIDIA GPUs
Apache License 2.0
206 stars 29 forks source link

Can not turn off GPU confidential computing #58

Closed hedj17 closed 4 months ago

hedj17 commented 5 months ago

when I am lunching a CVM, I get the following warning: qemu-system-x86_64: -device vfio-pci,host=99:00.0,bus=pci.1: warning: vfio_register_ram_discard_listener: possibly running out of DMA mappings. E.g., try increasing the 'block-size' of virtio-mem devies. Maximum possible DMA mappings: 1048576, Maximum possible memslots: 32764

I also change NVIDIA device state to nvidia-persistenced by changing /usr/lib/systemd/system/nvidia-persistenced.service to: ExecStart=/usr/bin/nvidia-persistenced --user nvidia-persistenced --uvm-persistence-mode --verbose

but when I run nvidia-smi, I get an error: No devices were found

when I run dmesg, I get:

[   17.562504] audit: type=1400 audit(1718457193.631:2): apparmor="STATUS" operation="profile_load" profile="unconfined" name="lsb_release" pid=769 comm="apparmor_parser"
[   17.562563] audit: type=1400 audit(1718457193.631:3): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe" pid=770 comm="apparmor_parser"
[   17.562568] audit: type=1400 audit(1718457193.631:4): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe//kmod" pid=770 comm="apparmor_parser"
[   17.562956] audit: type=1400 audit(1718457193.631:5): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/bin/man" pid=774 comm="apparmor_parser"
[   17.562959] audit: type=1400 audit(1718457193.631:6): apparmor="STATUS" operation="profile_load" profile="unconfined" name="man_filter" pid=774 comm="apparmor_parser"
[   17.562962] audit: type=1400 audit(1718457193.631:7): apparmor="STATUS" operation="profile_load" profile="unconfined" name="man_groff" pid=774 comm="apparmor_parser"
[   17.563072] audit: type=1400 audit(1718457193.631:8): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/lib/snapd/snap-confine" pid=776 comm="apparmor_parser"
[   17.563089] audit: type=1400 audit(1718457193.631:9): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/lib/snapd/snap-confine//mount-namespace-capture-helper" pid=776 comm="apparmor_parser"
[   17.565954] audit: type=1400 audit(1718457193.635:10): apparmor="STATUS" operation="profile_load" profile="unconfined" name="ubuntu_pro_apt_news" pid=772 comm="apparmor_parser"
[   17.566315] audit: type=1400 audit(1718457193.635:11): apparmor="STATUS" operation="profile_load" profile="unconfined" name="tcpdump" pid=775 comm="apparmor_parser"
[   18.703784] ACPI Warning: \_SB.PCI0.S08.S00._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20221020/nsarguments-61)
[   20.719551] loop3: detected capacity change from 0 to 8
[   20.793675] NVRM: gpuHandleSanityCheckRegReadError_GH100: Possible bad register read: addr: 0x8f0334,  regvalue: 0xbadf5108,  error code: Unknown SYS_PRI_ERROR_CODE
[   24.792233] NVRM: _threadNodeCheckTimeout: _threadNodeCheckTimeout: currentTime: 3d08be13c78000 >= 3d08be13c78000
[   24.792240] NVRM: _threadNodeCheckTimeout: _threadNodeCheckTimeout: Timeout was set to: 4000 msecs!
[   24.792248] NVRM: kgspBootstrap_GH100: Timeout waiting for GSP target mask release. This error may be caused by several reasons: Bootrom may have failed, GSP init code may have failed or ACR failed to release target mask. RM does not have access to information on which of those conditions happened.
[   24.792254] NVRM: gpuHandleSanityCheckRegReadError_GH100: Possible bad register read: addr: 0x8f0334,  regvalue: 0xbadf5108,  error code: Unknown SYS_PRI_ERROR_CODE
[   24.792256] NVRM: kfspDumpDebugState_GH100: FSP microcode v1.8
[   24.792257] NVRM: kfspDumpDebugState_GH100: GPU 0000:01:00
[   24.792260] NVRM: kfspDumpDebugState_GH100: NV_PFSP_FALCON_COMMON_SCRATCH_GROUP_2(0) = 0x0
[   24.792262] NVRM: kfspDumpDebugState_GH100: NV_PFSP_FALCON_COMMON_SCRATCH_GROUP_2(1) = 0x0
[   24.792263] NVRM: kfspDumpDebugState_GH100: NV_PFSP_FALCON_COMMON_SCRATCH_GROUP_2(2) = 0x0
[   24.792265] NVRM: kfspDumpDebugState_GH100: NV_PFSP_FALCON_COMMON_SCRATCH_GROUP_2(3) = 0x0
[   24.792888] NVRM: gpuHandleSanityCheckRegReadError_GH100: Possible bad register read: addr: 0x110804,  regvalue: 0xbadf4100,  error code: Unknown SYS_PRI_ERROR_CODE
[   24.792892] NVRM: RmInitAdapter: Cannot initialize GSP firmware RM
[   26.236374] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x62:0x65:1784)
[   26.238281] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[   28.134753] NVRM: gpuHandleSanityCheckRegReadError_GH100: Possible bad register read: addr: 0x8f0334,  regvalue: 0xbadf5108,  error code: Unknown SYS_PRI_ERROR_CODE
[   32.148237] NVRM: kfspPollForResponse_IMPL: FSP command timed out
[   32.148241] NVRM: kfspSendBootCommands_GH100: Sent following content to FSP:
[   32.148243] NVRM: kfspSendBootCommands_GH100: version=0x1, size=0x35c, gspFmcSysmemOffset=0x12a940000
[   32.148245] NVRM: kfspSendBootCommands_GH100: frtsSysmemOffset=0x0, frtsSysmemSize=0x0
[   32.148246] NVRM: kfspSendBootCommands_GH100: frtsVidmemOffset=0x200000, frtsVidmemSize=0x100000
[   32.148247] NVRM: kfspSendBootCommands_GH100: gspBootArgsSysmemOffset=0x12c46d000
[   32.148248] NVRM: kfspSendBootCommands_GH100: FSP boot cmds failed. RM cannot boot.
[   32.148252] NVRM: gpuHandleSanityCheckRegReadError_GH100: Possible bad register read: addr: 0x8f0334,  regvalue: 0xbadf5108,  error code: Unknown SYS_PRI_ERROR_CODE
[   32.148254] NVRM: kfspDumpDebugState_GH100: FSP microcode v1.8
[   32.148255] NVRM: kfspDumpDebugState_GH100: GPU 0000:01:00
[   32.148258] NVRM: kfspDumpDebugState_GH100: NV_PFSP_FALCON_COMMON_SCRATCH_GROUP_2(0) = 0x40
[   32.148259] NVRM: kfspDumpDebugState_GH100: NV_PFSP_FALCON_COMMON_SCRATCH_GROUP_2(1) = 0x110418
[   32.148261] NVRM: kfspDumpDebugState_GH100: NV_PFSP_FALCON_COMMON_SCRATCH_GROUP_2(2) = 0x1103c0
[   32.148264] NVRM: gpuHandleSanityCheckRegReadError_GH100: Possible bad register read: addr: 0x8f032c,  regvalue: 0xbadf57eb,  error code: Unknown SYS_PRI_ERROR_CODE
[   32.148265] NVRM: kfspDumpDebugState_GH100: NV_PFSP_FALCON_COMMON_SCRATCH_GROUP_2(3) = 0xbadf57eb
[   32.149286] NVRM: nvCheckOkFailedNoLog: Check failed: Call timed out [NV_ERR_TIMEOUT] (0x00000065) returned from kfspSendBootCommands_HAL(pGpu, pKernelFsp) @ kernel_gsp_gh100.c:756
[   32.149599] NVRM: gpuHandleSanityCheckRegReadError_GH100: Possible bad register read: addr: 0x110804,  regvalue: 0xbadf4100,  error code: Unknown SYS_PRI_ERROR_CODE
[   32.149602] NVRM: RmInitAdapter: Cannot initialize GSP firmware RM
[   33.604174] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x62:0x65:1784)
[   33.606553] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[   33.646854] show_signal_msg: 29 callbacks suppressed
[   33.646859] nvidia-persiste[827]: segfault at 44 ip 00007fc0c5608c11 sp 00007ffe58f26470 error 6 in libnvidia-cfg.so.550.54.15[7fc0c5600000+4d000] likely on CPU 19 (core 19, socket 0)
[   33.646886] Code: 00 31 c0 48 81 c4 10 08 00 00 5b 5d 41 5c 41 5d 41 5e c3 66 0f 1f 44 00 00 41 55 41 54 48 8d 57 48 55 53 48 89 fb 48 83 ec 28 <c7> 47 44 00 00 00 00 8b 77 08 8b 3f e8 8e fe ff ff 85 c0 89 c5 75
[  441.512859] NVRM: gpuHandleSanityCheckRegReadError_GH100: Possible bad register read: addr: 0x8f0334,  regvalue: 0xbadf5108,  error code: Unknown SYS_PRI_ERROR_CODE
[  445.525805] NVRM: kfspPollForResponse_IMPL: FSP command timed out
[  445.525810] NVRM: kfspSendBootCommands_GH100: Sent following content to FSP:
[  445.525812] NVRM: kfspSendBootCommands_GH100: version=0x1, size=0x35c, gspFmcSysmemOffset=0x125380000
[  445.525813] NVRM: kfspSendBootCommands_GH100: frtsSysmemOffset=0x0, frtsSysmemSize=0x0
[  445.525815] NVRM: kfspSendBootCommands_GH100: frtsVidmemOffset=0x200000, frtsVidmemSize=0x100000
[  445.525816] NVRM: kfspSendBootCommands_GH100: gspBootArgsSysmemOffset=0x11e5d4000
[  445.525817] NVRM: kfspSendBootCommands_GH100: FSP boot cmds failed. RM cannot boot.
[  445.525821] NVRM: gpuHandleSanityCheckRegReadError_GH100: Possible bad register read: addr: 0x8f0334,  regvalue: 0xbadf5108,  error code: Unknown SYS_PRI_ERROR_CODE
[  445.525822] NVRM: kfspDumpDebugState_GH100: FSP microcode v1.8
[  445.525823] NVRM: kfspDumpDebugState_GH100: GPU 0000:01:00
[  445.525827] NVRM: kfspDumpDebugState_GH100: NV_PFSP_FALCON_COMMON_SCRATCH_GROUP_2(0) = 0x40
[  445.525828] NVRM: kfspDumpDebugState_GH100: NV_PFSP_FALCON_COMMON_SCRATCH_GROUP_2(1) = 0x110418
[  445.525830] NVRM: kfspDumpDebugState_GH100: NV_PFSP_FALCON_COMMON_SCRATCH_GROUP_2(2) = 0x1103c0
[  445.525833] NVRM: gpuHandleSanityCheckRegReadError_GH100: Possible bad register read: addr: 0x8f032c,  regvalue: 0xbadf57eb,  error code: Unknown SYS_PRI_ERROR_CODE
[  445.525835] NVRM: kfspDumpDebugState_GH100: NV_PFSP_FALCON_COMMON_SCRATCH_GROUP_2(3) = 0xbadf57eb
[  445.526882] NVRM: nvCheckOkFailedNoLog: Check failed: Call timed out [NV_ERR_TIMEOUT] (0x00000065) returned from kfspSendBootCommands_HAL(pGpu, pKernelFsp) @ kernel_gsp_gh100.c:756
[  445.527195] NVRM: gpuHandleSanityCheckRegReadError_GH100: Possible bad register read: addr: 0x110804,  regvalue: 0xbadf4100,  error code: Unknown SYS_PRI_ERROR_CODE
[  445.527199] NVRM: RmInitAdapter: Cannot initialize GSP firmware RM
[  446.974944] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x62:0x65:1784)
[  446.977094] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0

When I try to turn off H100 confidential computing to reinitialize , I got this error:

2024-06-15,21:28:02.653 INFO     GPU 0000:99:00.0 H100-PCIE 0x2331 BAR0 0xce042000000 vbios_scratch_54 0x14d8 = 0x0
2024-06-15,21:28:02.653 INFO     GPU 0000:99:00.0 H100-PCIE 0x2331 BAR0 0xce042000000 vbios_scratch_55 0x14dc = 0x0
2024-06-15,21:28:02.653 INFO     GPU 0000:99:00.0 H100-PCIE 0x2331 BAR0 0xce042000000 vbios_scratch_56 0x14e0 = 0x0
2024-06-15,21:28:02.653 INFO     GPU 0000:99:00.0 H100-PCIE 0x2331 BAR0 0xce042000000 vbios_scratch_57 0x14e4 = 0x0
2024-06-15,21:28:02.653 INFO     GPU 0000:99:00.0 H100-PCIE 0x2331 BAR0 0xce042000000 vbios_scratch_58 0x14e8 = 0x0
2024-06-15,21:28:02.653 INFO     GPU 0000:99:00.0 H100-PCIE 0x2331 BAR0 0xce042000000 vbios_scratch_59 0x14ec = 0x0
2024-06-15,21:28:02.653 INFO     GPU 0000:99:00.0 H100-PCIE 0x2331 BAR0 0xce042000000 vbios_scratch_60 0x14f0 = 0x0
2024-06-15,21:28:02.653 INFO     GPU 0000:99:00.0 H100-PCIE 0x2331 BAR0 0xce042000000 vbios_scratch_61 0x14f4 = 0x0
2024-06-15,21:28:02.653 INFO     GPU 0000:99:00.0 H100-PCIE 0x2331 BAR0 0xce042000000 vbios_scratch_62 0x14f8 = 0x0
2024-06-15,21:28:02.654 INFO     GPU 0000:99:00.0 H100-PCIE 0x2331 BAR0 0xce042000000 vbios_scratch_63 0x14fc = 0x0
Traceback (most recent call last):
  File "/shared/gpu-admin-tools/./nvidia_gpu_tools.py", line 5086, in main
    gpu.set_cc_mode(opts.set_cc_mode)
  File "/shared/gpu-admin-tools/./nvidia_gpu_tools.py", line 4015, in set_cc_mode
    self.fsp_rpc.prc_knob_check_and_write(PrcKnob.PRC_KNOB_ID_CCD.value, cc_dev_mode)
  File "/shared/gpu-admin-tools/./nvidia_gpu_tools.py", line 3350, in prc_knob_check_and_write
    self.prc_knob_write(knob_id, value)
  File "/shared/gpu-admin-tools/./nvidia_gpu_tools.py", line 3341, in prc_knob_write
    data = self.prc_cmd([prc, prc_1])
  File "/shared/gpu-admin-tools/./nvidia_gpu_tools.py", line 3240, in prc_cmd
    self.poll_for_msg_queue()
  File "/shared/gpu-admin-tools/./nvidia_gpu_tools.py", line 3200, in poll_for_msg_queue
    raise GpuError(f"Timed out polling for {self.falcon.name} message queue on channel {self.channel_num}. head {mhead} == tail {mtail}")
__main__.GpuError: Timed out polling for fsp message queue on channel 2. head 2048 == tail 2048

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/shared/gpu-admin-tools/./nvidia_gpu_tools.py", line 5289, in <module>
    main()
  File "/shared/gpu-admin-tools/./nvidia_gpu_tools.py", line 5091, in main
    prc_knobs = gpu.query_prc_knobs()
  File "/shared/gpu-admin-tools/./nvidia_gpu_tools.py", line 4049, in query_prc_knobs
    knob_value = self.fsp_rpc.prc_knob_read(knob.value)
  File "/shared/gpu-admin-tools/./nvidia_gpu_tools.py", line 3318, in prc_knob_read
    data = self.prc_cmd([prc])
  File "/shared/gpu-admin-tools/./nvidia_gpu_tools.py", line 3227, in prc_cmd
    self.poll_for_queue_empty()
  File "/shared/gpu-admin-tools/./nvidia_gpu_tools.py", line 3214, in poll_for_queue_empty
    raise GpuError(f"Timed out polling for {self.falcon.name} cmd queue to be empty on channel {self.channel_num}. head {mhead} != tail {mtail}")
__main__.GpuError: Timed out polling for fsp cmd queue to be empty on channel 2. head 2048 != tail 2060
2024-06-15,21:28:03.762 WARNING  GPU 0000:99:00.0 H100-PCIE 0x2331 BAR0 0xce042000000 restoring power control to auto

How can I fix these problems?

Tan-YiFan commented 5 months ago

This issue also encountered running out of DMA mappings: https://github.com/NVIDIA/nvtrust/issues/46. However, that issue seems unsolved.

You should poweroff the VM before changing the CC state.

hedj17 commented 5 months ago

@Tan-YiFan Yes, I have powered off VM, but I can't change the cc state. should I refresh the firmware of GPU?

Tan-YiFan commented 5 months ago

Did you try other commands of nvidia_gpu_tools.py listed in https://github.com/NVIDIA/gpu-admin-tools?

If a GPU is in CC mode (on or devtools), the driver on the host is not able to communicate with it. Rebooting the host machine would panic at modprobe (in this case, please blacklist nvidia driver on the host via kernel command line).

hedj17 commented 5 months ago

@Tan-YiFan when I run other commands of nvnvidia_gpu_tools.py, I always get:

 NVIDIA GPU Tools version v2024.02.14o
Command line arguments: ['./nvidia_gpu_tools.py', '--gpu-bdf=99:00.0', '--reset-with-sbr']
Topo:
  Intel root port 0000:98:01.0
   GPU 0000:99:00.0 H100-PCIE 0x2331 BAR0 0xce042000000
2024-06-17,16:12:32.502 INFO     Selected GPU 0000:99:00.0 H100-PCIE 0x2331 BAR0 0xce042000000
2024-06-17,16:12:32.502 WARNING  GPU 0000:99:00.0 H100-PCIE 0x2331 BAR0 0xce042000000 has CC mode devtools, some functionality may not work

However, when I ran the command with "--recover-broken-gpu", I got this error:

NVIDIA GPU Tools version v2024.02.14o
Command line arguments: ['./nvidia_gpu_tools.py', '--gpu-bdf=99:00.0', '--recover-broken-gpu']
  File "/shared/gpu-admin-tools/./nvidia_gpu_tools.py", line 110, in find_gpus_sysfs
    dev = Gpu(dev_path=dev_path)
  File "/shared/gpu-admin-tools/./nvidia_gpu_tools.py", line 3671, in __init__
    raise BrokenGpuError()
2024-06-17,16:20:00.915 ERROR    GPU /sys/bus/pci/devices/0000:99:00.0 broken:
2024-06-17,16:20:00.917 ERROR    Config space working True
Topo:
  Intel root port 0000:98:01.0
   GPU 0000:99:00.0 ? 0x2331 BAR0 0xce042000000
   GPU 0000:99:00.0 [broken, cfg space working 1 bars configured 1]
2024-06-17,16:20:00.917 INFO     Selected GPU 0000:99:00.0 [broken, cfg space working 1 bars configured 1]
2024-06-17,16:20:01.350 ERROR    Config space working True
2024-06-17,16:20:01.350 ERROR    Failed to recover GPU 0000:99:00.0 [broken, cfg space working 1 bars configured 1]

I guessed it related to modprobe and rebooting the host machine, but I have uninstalled nvidia ,and when I ran "lspci -d 10de: -k", I got

99:00.0 3D controller: NVIDIA Corporation Device 2331 (rev a1)
        Subsystem: NVIDIA Corporation Device 1626
        Kernel modules: nvidiafb, nouveau

Should I still blacklist nvidia driver or it may relate to nouveau?

hedj17 commented 5 months ago

when I checked the kernel message in host machine, I got the following message

[   36.839018] nouveau 0000:99:00.0: unknown chipset (180000a1)
[   45.118109] audit: type=1400 audit(1718611588.572:63): apparmor="DENIED" operation="capable" class="cap" profile="/usr/sbin/cupsd" pid=1345 comm="cupsd" capability=12  capname="net_admin"
[   45.943537] loop12: detected capacity change from 0 to 8
[  105.148020] bridge: filtering via arp/ip/ip6tables is no longer available by default. Update your scripts to load br_netfilter if you need this.
[  114.416830] audit: type=1400 audit(1718611657.771:64): apparmor="DENIED" operation="capable" class="cap" profile="/snap/snapd/21759/usr/lib/snapd/snap-confine" pid=1730 comm="snap-confine" capability=12  capname="net_admin"
[  114.416851] audit: type=1400 audit(1718611657.771:65): apparmor="DENIED" operation="capable" class="cap" profile="/snap/snapd/21759/usr/lib/snapd/snap-confine" pid=1730 comm="snap-confine" capability=38  capname="perfmon"
Tan-YiFan commented 5 months ago

Please try modprobe -r all nvidia-related modules to this GPU and bind it to vfio.

Please also confirm that the VBIOS version meets the requirement.

hedj17 commented 5 months ago

@Tan-YiFan I have reinstalled nvidia in host, but I also got following message in host machine. How can I query VBIOS version? No devices were found I have also tried modprobe -r all nvidia-related modules to this GPU and bound it to vfio. I got same message

Traceback (most recent call last):
  File "/shared/gpu-admin-tools/./nvidia_gpu_tools.py", line 5086, in main
    gpu.set_cc_mode(opts.set_cc_mode)
  File "/shared/gpu-admin-tools/./nvidia_gpu_tools.py", line 4015, in set_cc_mode
    self.fsp_rpc.prc_knob_check_and_write(PrcKnob.PRC_KNOB_ID_CCD.value, cc_dev_mode)
  File "/shared/gpu-admin-tools/./nvidia_gpu_tools.py", line 3350, in prc_knob_check_and_write
    self.prc_knob_write(knob_id, value)
  File "/shared/gpu-admin-tools/./nvidia_gpu_tools.py", line 3341, in prc_knob_write
    data = self.prc_cmd([prc, prc_1])
  File "/shared/gpu-admin-tools/./nvidia_gpu_tools.py", line 3240, in prc_cmd
    self.poll_for_msg_queue()
  File "/shared/gpu-admin-tools/./nvidia_gpu_tools.py", line 3200, in poll_for_msg_queue
    raise GpuError(f"Timed out polling for {self.falcon.name} message queue on channel {self.channel_num}. head {mhead} == tail {mtail}")
__main__.GpuError: Timed out polling for fsp message queue on channel 2. head 2048 == tail 2048

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/shared/gpu-admin-tools/./nvidia_gpu_tools.py", line 5289, in <module>
    main()
  File "/shared/gpu-admin-tools/./nvidia_gpu_tools.py", line 5091, in main
    prc_knobs = gpu.query_prc_knobs()
  File "/shared/gpu-admin-tools/./nvidia_gpu_tools.py", line 4049, in query_prc_knobs
    knob_value = self.fsp_rpc.prc_knob_read(knob.value)
  File "/shared/gpu-admin-tools/./nvidia_gpu_tools.py", line 3318, in prc_knob_read
    data = self.prc_cmd([prc])
  File "/shared/gpu-admin-tools/./nvidia_gpu_tools.py", line 3227, in prc_cmd
    self.poll_for_queue_empty()
  File "/shared/gpu-admin-tools/./nvidia_gpu_tools.py", line 3214, in poll_for_queue_empty
    raise GpuError(f"Timed out polling for {self.falcon.name} cmd queue to be empty on channel {self.channel_num}. head {mhead} != tail {mtail}")
__main__.GpuError: Timed out polling for fsp cmd queue to be empty on channel 2. head 2048 != tail 2060

when I checked the kernel message in host machine, I got the following message:

[  447.195093] vfio-pci 0000:99:00.0: Enabling HDA controller
[  505.280698] vfio-pci 0000:99:00.0: Enabling HDA controller
[  505.519841] vfio-pci 0000:99:00.0: vfio_ecap_init: hiding ecap 0x19@0x100
[  505.519853] vfio-pci 0000:99:00.0: vfio_ecap_init: hiding ecap 0x24@0x140
[  505.519856] vfio-pci 0000:99:00.0: vfio_ecap_init: hiding ecap 0x25@0x14c
[  505.519858] vfio-pci 0000:99:00.0: vfio_ecap_init: hiding ecap 0x26@0x158
[  505.519860] vfio-pci 0000:99:00.0: vfio_ecap_init: hiding ecap 0x2a@0x188
[  505.519868] vfio-pci 0000:99:00.0: vfio_ecap_init: hiding ecap 0x27@0x200
[  505.519887] vfio-pci 0000:99:00.0: vfio_ecap_init: hiding ecap 0x2e@0x2c8
[  505.529397] vfio-pci 0000:99:00.0: Invalid PCI ROM header signature: expecting 0xaa55, got 0x564e

The only thing to be happy about is that when I run the nvidia_gpu_tools.py command with --recover-broken-gpu, no broken message appears.

Is it possible for the firmware of the GPU to be locked or destroyed?

hiroki-chen commented 4 months ago

You can try reset the GPU if it is suddenly stuck in some deadlock or unrecoverable states.

sudo echo 1 > /sys/bus/pci/devices/0000:xx:00.0/reset

Hope my late response helps.

pjaroszynski commented 4 months ago

@hedj17 please post a full log of nvidia_gpu_tools.py with ... --set-cc-mode off --log debug

hedj17 commented 4 months ago

I have soled this prolem by refreshing the firmware of GPU. The problem is that the firmware version is too old

sreimchangwalker commented 1 month ago

I have soled this prolem by refreshing the firmware of GPU. The problem is that the firmware version is too old我通过刷新GPU固件解决了这个问题。问题是固件版本太旧 Could you please tell me how to refresh the GPU firmware? I have the same problem.

Tan-YiFan commented 1 month ago

Please contact the vendor of your server (e.g., Dell, SuperMicro, instead of Nvidia) for firmware update