NVIDIA / nvtrust

Ancillary open source software to support confidential computing on NVIDIA GPUs
Apache License 2.0
175 stars 27 forks source link

Do we need install the nvidia KMD driver on the host-side to enable the HCC VM? #15

Closed JustPlay closed 8 months ago

Tan-YiFan commented 8 months ago

The responsibility of the host is to execute the script: host_tools/python/gpu_cc_tool.py. What problem did you encounter?

JustPlay commented 8 months ago

The responsibility of the host is to execute the script: host_tools/python/gpu_cc_tool.py. What problem did you encounter?


root@p133-011-144:~/tdx/scripts/nvidia.d/nvtrust/host_tools/python# python3 gpu_cc_tool.py --query-cc-settings
NVIDIA GPU Tools version 535.86.06
File "gpu_cc_tool.py", line 127, in find_gpus_sysfs
dev = Gpu(dev_path=dev_path)
File "gpu_cc_tool.py", line 2055, in __init__
self.bar0 = self._map_bar(0)
File "gpu_cc_tool.py", line 1163, in _map_bar
return FileMap("/dev/mem", bar_addr, bar_size)
File "gpu_cc_tool.py", line 239, in __init__
mapped = mmap.mmap(f.fileno(), size, mmap.MAP_SHARED, prot, offset=offset)
2023-10-08,11:07:09.880 ERROR    GPU /sys/bus/pci/devices/0000:0f:00.0 broken: [Errno 1] Operation not permitted
2023-10-08,11:07:09.884 ERROR    Config space working True
Traceback (most recent call last):
File "gpu_cc_tool.py", line 127, in find_gpus_sysfs
dev = Gpu(dev_path=dev_path)
File "gpu_cc_tool.py", line 2055, in __init__
self.bar0 = self._map_bar(0)
File "gpu_cc_tool.py", line 1163, in _map_bar
return FileMap("/dev/mem", bar_addr, bar_size)
File "gpu_cc_tool.py", line 239, in __init__
mapped = mmap.mmap(f.fileno(), size, mmap.MAP_SHARED, prot, offset=offset)
PermissionError: [Errno 1] Operation not permitted

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "gpu_cc_tool.py", line 2499, in main() File "gpu_cc_tool.py", line 2443, in main gpus, other = find_gpus() File "gpu_cc_tool.py", line 146, in find_gpus return find_gpus_sysfs(bdf) File "gpu_cc_tool.py", line 137, in find_gpus_sysfs dev = BrokenGpu(dev_path=dev_path) File "gpu_cc_tool.py", line 1269, in init self.bars_configured = self.sanity_check_cfg_space_bars() AttributeError: 'BrokenGpu' object has no attribute 'sanity_check_cfg_space_bars'

Tan-YiFan commented 8 months ago

The Nvidia Driver is not required to be installed on the host machine.

To handle the issue of failed mmap, the following steps might help:

  1. Check dmesg to see if there is any error message
  2. The error is caused by mmap /dev/mem. You could ask the search engine for help. One possible answer: https://stackoverflow.com/questions/8213671/mmap-operation-not-permitted
JustPlay commented 8 months ago

2. https://stackoverflow.com/questions/8213671/mmap-operation-not-permitted

yes, i found it is due to CONFIG_STRICT_DEVMEM=y # Filter access to /dev/mem CONFIG_IO_STRICT_DEVMEM=y # Filter I/O access to /dev/mem

Tan-YiFan commented 8 months ago

Have you solved this problem? If not, it seems that changing mmio_access_type to sysfs may help.

JustPlay commented 8 months ago

Have you solved this problem? If not, it seems that changing mmio_access_type to sysfs may help.

i have try-ed mmio_access_type=sysfs, but did not work

root@p133-011-144:~/tdx/scripts/nvidia.d/nvtrust/host_tools/python# python3 gpu_cc_tool.py --query-cc-mode
NVIDIA GPU Tools version 535.104.12
file=/sys/bus/pci/devices/0000:0f:00.0/resource0, size=16777216, offset=0
  File "gpu_cc_tool.py", line 128, in find_gpus_sysfs
    dev = Gpu(dev_path=dev_path)
  File "gpu_cc_tool.py", line 2059, in __init__
    self.bar0 = self._map_bar(0)
  File "gpu_cc_tool.py", line 1165, in _map_bar
    return FileMap(os.path.join(self.dev_path, f"resource{self._bar_num_to_sysfs_resource(bar_num)}"), 0, bar_size)
  File "gpu_cc_tool.py", line 241, in __init__
    mapped = mmap.mmap(f.fileno(), size, mmap.MAP_SHARED, prot, offset=offset)
2023-10-08,13:02:41.287 ERROR    GPU /sys/bus/pci/devices/0000:0f:00.0 broken: [Errno 22] Invalid argument
2023-10-08,13:02:41.291 ERROR    Config space working True

I add iomem=relaxed to the host kernel cmdline, the gpu_cc_tools.py seems to work in both devmem mode and sysfs mode (没有每一步测试和继续验证,因为机器死机了,还在抢救)

JustPlay commented 8 months ago

Have you solved this problem? If not, it seems that changing mmio_access_type to sysfs may help.

CONFIG_STRICT_DEVMEM=y CONFIG_IO_STRICT_DEVMEM=n

可能也ok,

https://elixir.bootlin.com/linux/v6.5.5/source/lib/Kconfig.debug#L1838

Tan-YiFan commented 8 months ago

If you could recompile the host kernel, setting CONFIG_IO_STRICT_DEVMEM to N may help. I do not have access to H800 but I could mmap a device address on my Ubuntu which sets CONFIG_IO_STRICT_DEVMEM to be N.

rnertney commented 8 months ago

You shouldn't need the kernel parameter listed above. When you try to set the GPU mode, can you first show me the output of lspci -vd 10de:?

rnertney commented 8 months ago

Closing due to inactivity. Short version is that one does not need to install the driver in the host to toggle CC modes. Please reference our deployment guide for step-by-step instructions to configure your machine.