Closed hedj17 closed 4 months ago
This issue also encountered running out of DMA mappings
: https://github.com/NVIDIA/nvtrust/issues/46. However, that issue seems unsolved.
You should poweroff the VM before changing the CC state.
@Tan-YiFan Yes, I have powered off VM, but I can't change the cc state. should I refresh the firmware of GPU?
Did you try other commands of nvidia_gpu_tools.py
listed in https://github.com/NVIDIA/gpu-admin-tools?
If a GPU is in CC mode (on
or devtools
), the driver on the host is not able to communicate with it. Rebooting the host machine would panic at modprobe (in this case, please blacklist nvidia driver on the host via kernel command line).
@Tan-YiFan when I run other commands of nvnvidia_gpu_tools.py, I always get:
NVIDIA GPU Tools version v2024.02.14o
Command line arguments: ['./nvidia_gpu_tools.py', '--gpu-bdf=99:00.0', '--reset-with-sbr']
Topo:
Intel root port 0000:98:01.0
GPU 0000:99:00.0 H100-PCIE 0x2331 BAR0 0xce042000000
2024-06-17,16:12:32.502 INFO Selected GPU 0000:99:00.0 H100-PCIE 0x2331 BAR0 0xce042000000
2024-06-17,16:12:32.502 WARNING GPU 0000:99:00.0 H100-PCIE 0x2331 BAR0 0xce042000000 has CC mode devtools, some functionality may not work
However, when I ran the command with "--recover-broken-gpu", I got this error:
NVIDIA GPU Tools version v2024.02.14o
Command line arguments: ['./nvidia_gpu_tools.py', '--gpu-bdf=99:00.0', '--recover-broken-gpu']
File "/shared/gpu-admin-tools/./nvidia_gpu_tools.py", line 110, in find_gpus_sysfs
dev = Gpu(dev_path=dev_path)
File "/shared/gpu-admin-tools/./nvidia_gpu_tools.py", line 3671, in __init__
raise BrokenGpuError()
2024-06-17,16:20:00.915 ERROR GPU /sys/bus/pci/devices/0000:99:00.0 broken:
2024-06-17,16:20:00.917 ERROR Config space working True
Topo:
Intel root port 0000:98:01.0
GPU 0000:99:00.0 ? 0x2331 BAR0 0xce042000000
GPU 0000:99:00.0 [broken, cfg space working 1 bars configured 1]
2024-06-17,16:20:00.917 INFO Selected GPU 0000:99:00.0 [broken, cfg space working 1 bars configured 1]
2024-06-17,16:20:01.350 ERROR Config space working True
2024-06-17,16:20:01.350 ERROR Failed to recover GPU 0000:99:00.0 [broken, cfg space working 1 bars configured 1]
I guessed it related to modprobe and rebooting the host machine, but I have uninstalled nvidia ,and when I ran "lspci -d 10de: -k", I got
99:00.0 3D controller: NVIDIA Corporation Device 2331 (rev a1)
Subsystem: NVIDIA Corporation Device 1626
Kernel modules: nvidiafb, nouveau
Should I still blacklist nvidia driver or it may relate to nouveau?
when I checked the kernel message in host machine, I got the following message
[ 36.839018] nouveau 0000:99:00.0: unknown chipset (180000a1)
[ 45.118109] audit: type=1400 audit(1718611588.572:63): apparmor="DENIED" operation="capable" class="cap" profile="/usr/sbin/cupsd" pid=1345 comm="cupsd" capability=12 capname="net_admin"
[ 45.943537] loop12: detected capacity change from 0 to 8
[ 105.148020] bridge: filtering via arp/ip/ip6tables is no longer available by default. Update your scripts to load br_netfilter if you need this.
[ 114.416830] audit: type=1400 audit(1718611657.771:64): apparmor="DENIED" operation="capable" class="cap" profile="/snap/snapd/21759/usr/lib/snapd/snap-confine" pid=1730 comm="snap-confine" capability=12 capname="net_admin"
[ 114.416851] audit: type=1400 audit(1718611657.771:65): apparmor="DENIED" operation="capable" class="cap" profile="/snap/snapd/21759/usr/lib/snapd/snap-confine" pid=1730 comm="snap-confine" capability=38 capname="perfmon"
Please try modprobe -r
all nvidia-related modules to this GPU and bind it to vfio.
Please also confirm that the VBIOS version meets the requirement.
@Tan-YiFan I have reinstalled nvidia in host, but I also got following message in host machine. How can I query VBIOS version?
No devices were found
I have also tried modprobe -r all nvidia-related modules to this GPU and bound it to vfio. I got same message
Traceback (most recent call last):
File "/shared/gpu-admin-tools/./nvidia_gpu_tools.py", line 5086, in main
gpu.set_cc_mode(opts.set_cc_mode)
File "/shared/gpu-admin-tools/./nvidia_gpu_tools.py", line 4015, in set_cc_mode
self.fsp_rpc.prc_knob_check_and_write(PrcKnob.PRC_KNOB_ID_CCD.value, cc_dev_mode)
File "/shared/gpu-admin-tools/./nvidia_gpu_tools.py", line 3350, in prc_knob_check_and_write
self.prc_knob_write(knob_id, value)
File "/shared/gpu-admin-tools/./nvidia_gpu_tools.py", line 3341, in prc_knob_write
data = self.prc_cmd([prc, prc_1])
File "/shared/gpu-admin-tools/./nvidia_gpu_tools.py", line 3240, in prc_cmd
self.poll_for_msg_queue()
File "/shared/gpu-admin-tools/./nvidia_gpu_tools.py", line 3200, in poll_for_msg_queue
raise GpuError(f"Timed out polling for {self.falcon.name} message queue on channel {self.channel_num}. head {mhead} == tail {mtail}")
__main__.GpuError: Timed out polling for fsp message queue on channel 2. head 2048 == tail 2048
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/shared/gpu-admin-tools/./nvidia_gpu_tools.py", line 5289, in <module>
main()
File "/shared/gpu-admin-tools/./nvidia_gpu_tools.py", line 5091, in main
prc_knobs = gpu.query_prc_knobs()
File "/shared/gpu-admin-tools/./nvidia_gpu_tools.py", line 4049, in query_prc_knobs
knob_value = self.fsp_rpc.prc_knob_read(knob.value)
File "/shared/gpu-admin-tools/./nvidia_gpu_tools.py", line 3318, in prc_knob_read
data = self.prc_cmd([prc])
File "/shared/gpu-admin-tools/./nvidia_gpu_tools.py", line 3227, in prc_cmd
self.poll_for_queue_empty()
File "/shared/gpu-admin-tools/./nvidia_gpu_tools.py", line 3214, in poll_for_queue_empty
raise GpuError(f"Timed out polling for {self.falcon.name} cmd queue to be empty on channel {self.channel_num}. head {mhead} != tail {mtail}")
__main__.GpuError: Timed out polling for fsp cmd queue to be empty on channel 2. head 2048 != tail 2060
when I checked the kernel message in host machine, I got the following message:
[ 447.195093] vfio-pci 0000:99:00.0: Enabling HDA controller
[ 505.280698] vfio-pci 0000:99:00.0: Enabling HDA controller
[ 505.519841] vfio-pci 0000:99:00.0: vfio_ecap_init: hiding ecap 0x19@0x100
[ 505.519853] vfio-pci 0000:99:00.0: vfio_ecap_init: hiding ecap 0x24@0x140
[ 505.519856] vfio-pci 0000:99:00.0: vfio_ecap_init: hiding ecap 0x25@0x14c
[ 505.519858] vfio-pci 0000:99:00.0: vfio_ecap_init: hiding ecap 0x26@0x158
[ 505.519860] vfio-pci 0000:99:00.0: vfio_ecap_init: hiding ecap 0x2a@0x188
[ 505.519868] vfio-pci 0000:99:00.0: vfio_ecap_init: hiding ecap 0x27@0x200
[ 505.519887] vfio-pci 0000:99:00.0: vfio_ecap_init: hiding ecap 0x2e@0x2c8
[ 505.529397] vfio-pci 0000:99:00.0: Invalid PCI ROM header signature: expecting 0xaa55, got 0x564e
The only thing to be happy about is that when I run the nvidia_gpu_tools.py command with --recover-broken-gpu, no broken message appears.
Is it possible for the firmware of the GPU to be locked or destroyed?
You can try reset the GPU if it is suddenly stuck in some deadlock or unrecoverable states.
sudo echo 1 > /sys/bus/pci/devices/0000:xx:00.0/reset
Hope my late response helps.
@hedj17 please post a full log of nvidia_gpu_tools.py with ... --set-cc-mode off --log debug
I have soled this prolem by refreshing the firmware of GPU. The problem is that the firmware version is too old
I have soled this prolem by refreshing the firmware of GPU. The problem is that the firmware version is too old我通过刷新GPU固件解决了这个问题。问题是固件版本太旧 Could you please tell me how to refresh the GPU firmware? I have the same problem.
Please contact the vendor of your server (e.g., Dell, SuperMicro, instead of Nvidia) for firmware update
when I am lunching a CVM, I get the following warning:
qemu-system-x86_64: -device vfio-pci,host=99:00.0,bus=pci.1: warning: vfio_register_ram_discard_listener: possibly running out of DMA mappings. E.g., try increasing the 'block-size' of virtio-mem devies. Maximum possible DMA mappings: 1048576, Maximum possible memslots: 32764
I also change NVIDIA device state to nvidia-persistenced by changing /usr/lib/systemd/system/nvidia-persistenced.service to:
ExecStart=/usr/bin/nvidia-persistenced --user nvidia-persistenced --uvm-persistence-mode --verbose
but when I run nvidia-smi, I get an error:
No devices were found
when I run dmesg, I get:
When I try to turn off H100 confidential computing to reinitialize , I got this error:
How can I fix these problems?