canonical / tdx

Intel confidential computing - TDX
GNU General Public License v3.0
106 stars 42 forks source link

nvidia-smi No device were found= #292

Closed gongchangsui closed 10 hours ago

gongchangsui commented 2 days ago

To pass through the GPU to the tdx VM, I modify guest-tools/run_td.sh, add -device vfio-pci,host=0000:85:00.0 \

qemu-system-x86_64 -D $LOGFILE \
                   -accel kvm \
                   -m 32G -smp 48 \
                   -name ${PROCESS_NAME},process=${PROCESS_NAME},debug-threads=on \
                   -cpu host \
                   -object '{"qom-type":"tdx-guest","id":"tdx","quote-generation-socket":{"type": "vsock", "cid":"2","port":"4050"}}' \
                   -machine q35,kernel_irqchip=split,confidential-guest-support=tdx,hpet=off \
                   -bios ${TDVF_FIRMWARE} \
                   -nographic -daemonize \
                   -nodefaults \
                   -device virtio-net-pci,netdev=nic0_td -netdev user,id=nic0_td,hostfwd=tcp::${SSH_PORT}-:22 \
                   -drive file=${TD_IMG},if=none,id=virtio-disk0 \
                   -device virtio-blk-pci,drive=virtio-disk0 \
                   -device vfio-pci,host=0000:85:00.0 \
                   ${QUOTE_VSOCK_ARGS} \
                   -pidfile /tmp/tdx-demo-td-pid.pid

After nvidia driver is installd, run nvidia-smi and output No device were found. dmesg log:

[   52.832378] nvidia 0000:00:03.0: swiotlb buffer is full (sz: 524288 bytes), total 262144 (slots), used 1044 (slots)
[   52.832530] NVRM: 0000:00:03.0: Failed to create a DMA mapping!
[   52.846980] NVRM: GPU 0000:00:03.0: RmInitAdapter failed! (0x62:0x59:2535)
[   52.875007] NVRM: GPU 0000:00:03.0: rm_init_adapter failed, device minor number 0
[   52.894003] [drm:nv_drm_load [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000003] Failed to allocate NvKmsKapiDevice
[   52.919209] [drm:nv_drm_register_drm_device [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000003] Failed to register device
[   53.953330] cfg80211: Loading compiled-in X.509 certificates for regulatory database
[   53.981327] Loaded X.509 cert 'sforshee: 00b28ddf47aef9cea7'
[   53.992975] Loaded X.509 cert 'wens: 61c038651aabdcf94bd0ac7ff06c7248db18c600'
[   54.172397] nvidia_uvm: module uses symbols nvUvmInterfaceDisableAccessCntr from proprietary module nvidia, inheriting taint.
[   54.381275] nvidia-uvm: Loaded the UVM driver, major device number 235.
[   55.541269] NET: Registered PF_QIPCRTR protocol family
[   57.908201] loop0: detected capacity change from 0 to 8
[   59.049163] kauditd_printk_skb: 111 callbacks suppressed
[   59.049239] audit: type=1400 audit(1733208919.027:123): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="/usr/lib/snapd/snap-confine" pid=1389 comm="apparmor_parser"
[   59.068164] audit: type=1400 audit(1733208919.046:124): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="/usr/lib/snapd/snap-confine//mount-namespace-capture-helper" pid=1389 comm="apparmor_parser"

Describe the support request A clear and concise description of what you are looking support for.

System report Please run the system-report.sh script (located in the root directory of this repo) on your host system and copy the output below.

Git ref

cd2f1fb329aa61355bdcf779b0f17f2c22f4d18a

Operating system details

Distributor ID: Ubuntu
Description:    Ubuntu 24.04.1 LTS
Release:        24.04
Codename:       noble

Kernel version

6.8.0-1013-intel #20-Ubuntu SMP PREEMPT_DYNAMIC Thu Oct  3 17:38:00 UTC 2024 x86_64 x86_64 GNU/Linux

TDX kernel logs

[    0.000000] Command line: BOOT_IMAGE=/vmlinuz-6.8.0-1013-intel root=/dev/mapper/ubuntu--vg-ubuntu--lv ro kvm_intel.tdx=1 nohibernate
[    1.215181] Kernel command line: BOOT_IMAGE=/vmlinuz-6.8.0-1013-intel root=/dev/mapper/ubuntu--vg-ubuntu--lv ro kvm_intel.tdx=1 nohibernate
[    2.369765] virt/tdx: BIOS enabled: private KeyID range [32, 64)
[    2.369769] virt/tdx: Disable ACPI S3. Turn off TDX in the BIOS to use ACPI S3.
[   20.532196] virt/tdx: TDX module: attributes 0x0, vendor_id 0x8086, major_version 1, minor_version 5, build_date 20231008, build_num 595
[   20.532204] virt/tdx: CMR: [0x100000, 0x77800000)
[   20.532208] virt/tdx: CMR: [0x100000000, 0x207a000000)
[   20.532209] virt/tdx: CMR: [0x2080000000, 0x407c000000)
[   20.532210] virt/tdx: CMR: [0x4080000000, 0x607c000000)
[   20.532212] virt/tdx: CMR: [0x6080000000, 0x807c000000)
...
[   20.532212] virt/tdx: CMR: [0x6080000000, 0x807c000000)
[   22.610086] virt/tdx: 2101268 KB allocated for PAMT
[   22.610097] virt/tdx: module initialized
[ 6026.592028] WARNING: CPU: 168 PID: 18067 at arch/x86/kvm/vmx/tdx.c:1669 tdx_sept_split_private_spt+0xf4/0x190 [kvm_intel]
[ 6026.592159] RIP: 0010:tdx_sept_split_private_spt+0xf4/0x190 [kvm_intel]
[ 6026.592218]  ? tdx_sept_split_private_spt+0xf4/0x190 [kvm_intel]
[ 6026.592255]  ? tdx_sept_split_private_spt+0xf4/0x190 [kvm_intel]
[ 6026.592267]  ? tdx_sept_split_private_spt+0xb5/0x190 [kvm_intel]
[ 6026.593089] WARNING: CPU: 168 PID: 18067 at arch/x86/kvm/vmx/tdx.c:280 __tdx_reclaim_page+0xc9/0xe0 [kvm_intel]
[ 6026.593179] RIP: 0010:__tdx_reclaim_page+0xc9/0xe0 [kvm_intel]
[ 6026.593215]  ? __tdx_reclaim_page+0xc9/0xe0 [kvm_intel]
[ 6026.593238]  ? __tdx_reclaim_page+0xc9/0xe0 [kvm_intel]
[ 6026.593252]  tdx_sept_drop_private_spte+0x26f/0x2f0 [kvm_intel]
[ 6026.593266]  tdx_sept_remove_private_spte+0x3f/0x50 [kvm_intel]
[ 6026.593889] WARNING: CPU: 191 PID: 18173 at arch/x86/kvm/vmx/tdx.c:275 __tdx_reclaim_page+0xac/0xe0 [kvm_intel]
[ 6026.593991] RIP: 0010:__tdx_reclaim_page+0xac/0xe0 [kvm_intel]
[ 6026.594031]  ? __tdx_reclaim_page+0xac/0xe0 [kvm_intel]
[ 6026.594059]  ? __tdx_reclaim_page+0xac/0xe0 [kvm_intel]
[ 6026.594073]  tdx_sept_drop_private_spte+0x26f/0x2f0 [kvm_intel]
[ 6026.594090]  tdx_sept_remove_private_spte+0x3f/0x50 [kvm_intel]

TDX CPU instruction support

CPU supports TDX according to /proc/cpuinfo

Model specific registers (MSRs)

MK_TME_ENABLED bit: 1 (expected value: 1)
SEAM_RR bit: 1 (expected value: 1)
NUM_TDX_PRIV_KEYS: 20
SGX_AND_MCHECK_STATUS: 0 (expected value: 0)
Production platform: Pre-production (expected value: Production)

CPU details

 INTEL(R) XEON(R) PLATINUM 8558P

QEMU package details

Status: Installed
Package: qemu-system-x86
Version: 2:8.2.2+ds-0ubuntu1.4+tdx1.0
APT-Sources: https://ppa.launchpadcontent.net/kobuk-team/tdx-release/ubuntu noble/main amd64 Packages

Libvirt package details

Status: Installed
Package: libvirt-clients
Version: 10.0.0-2ubuntu8.3+tdx1.2
APT-Sources: https://ppa.launchpadcontent.net/kobuk-team/tdx-release/ubuntu noble/main amd64 Packages

OVMF package details

Status: Installed
Package: ovmf
Version: 2024.02-3+tdx1.0
APT-Sources: https://ppa.launchpadcontent.net/kobuk-team/tdx-release/ubuntu noble/main amd64 Packages

sgx-dcap-pccs package details

Status: Not Installed

tdx-qgs package details

Status: Not Installed

sgx-ra-service package details

Status: Not Installed

sgx-pck-id-retrieval-tool package details

Status: Not Installed

QGSD service status

Unit qgsd.service could not be found.

PCCS service status

Unit pccs.service could not be found.

MPA registration logs (last 30 lines)

syncronize-issues-to-jira[bot] commented 2 days ago

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/PEK-1508.

This message was autogenerated

hector-cao commented 1 day ago

@gongchangsui Thanks for your feedback, right now, our TDX solution does not support device pass-through,

gongchangsui commented 1 day ago

@gongchangsui Thanks for your feedback, right now, our TDX solution does not support device pass-through,

Thank you for your reply, Is there any other way to use GPU in TDX solution?

BFuhry commented 19 hours ago

Intel provides the necessary kernel patches (https://github.com/intel/tdx-linux/tree/device-passthrough) for test purposes. For NVidia GPUs, there is also a dedicated guide: https://docs.nvidia.com/cc-deployment-guide-tdx.pdf.

gongchangsui commented 10 hours ago

Intel provides the necessary kernel patches (https://github.com/intel/tdx-linux/tree/device-passthrough) for test purposes. For NVidia GPUs, there is also a dedicated guide: https://docs.nvidia.com/cc-deployment-guide-tdx.pdf.

Thank you