Failed to load NVDIA driver within CVM (TDX)

herozyg commented 1 year ago

NVIDIA Open GPU Kernel Modules Version

535.54.03

Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.

[X] I confirm that this does not happen with the proprietary driver package.

Operating System and Version

Ubuntu22.04

Kernel Release

6.2

Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.

[X] I am running on a stable kernel release.

Hardware: GPU

A10

Describe the bug

Installed the latest drvier in a TDVM and failed to run "nvidia-smi", log as below:

Could you please give any advices? Thank you!

To Reproduce

GPU: A10 CPU: Intel CPU w/ TDX Install Latest driver 535.54.03 in TDVM. Run cmd"nvidia-smi"

Bug Incidence

Always

nvidia-bug-report.log.gz

no.

More Info

No response

aritger commented 1 year ago

Is it possible to generate and attach an nvidia-bug-report.log.gz? Maybe you would need to run nvidia-bug-report.sh with --safe-mode. Or, maybe attach your kernel log? It would be nice to be able to copy&paste error messages, rather than transcribe from a screenshot.

The "swiotlb buffer is full" error sounds like the problem.

Can you double check that the NVIDIA proprietary driver at the same version (535.54.03) works fine in this configuration? I'm surprised if interaction with swiotlb would different between the open and closed kernel modules.

herozyg commented 1 year ago

FYI. Thanks @aritger for your replay. Attach report for your review. nvidia-bug-report.log.gz

gaochaointel commented 1 year ago

My understanding of this issue is: swiotlb currently can allocate up to 256KB contiguous memory. this is a limitation in swiotlb. but the driver requested memory over that limit (e.g., 1MB). so allocation failed and swiotlb reported "swiotlb buffer is full".

we may need some driver changes to support TDX VM. e.g., is it possible for the driver to switch to use dma_alloc/free_coherent() to allocate DMA buffer instead of dmamap/unmap* family?

Tan-YiFan commented 1 year ago

Hi @herozyg ,

swiotlb buffer is full while sz / 4096 <= total - used might result from Linux not supporting > 512KB (or 256KB) contiguous memory, as @gaochaointel said.

You can try increasing this limit: change the value from 128 to 1024 at https://elixir.bootlin.com/linux/v6.2/source/include/linux/swiotlb.h#L25. Then recompile the guest kernel and boot.

Furthermore, try adding swiotlb=131072,force to qemu parameter -append to increase the size of swiotlb. (For example, -append "swiotlb=131072,force").

Notice: modifying the kernel or not using the default swiotlb size may hurt the performance.

RodgerZhu commented 1 year ago

Hi @herozyg ,

swiotlb buffer is full while sz / 4096 <= total - used might result from Linux not supporting > 512KB (or 256KB) contiguous memory, as @gaochaointel said.

You can try increasing this limit: change the value from 128 to 1024 at https://elixir.bootlin.com/linux/v6.2/source/include/linux/swiotlb.h#L25. Then recompile the guest kernel and boot.

Furthermore, try adding swiotlb=131072,force to qemu parameter -append to increase the size of swiotlb. (For example, -append "swiotlb=131072,force").

Notice: modifying the kernel or not using the default swiotlb size may hurt the performance.

Thanks Yifan. Actually, I tried to update 128 to 1024, but still got the same error.

RodgerZhu commented 1 year ago

Hi @herozyg ,

swiotlb buffer is full while sz / 4096 <= total - used might result from Linux not supporting > 512KB (or 256KB) contiguous memory, as @gaochaointel said.

You can try increasing this limit: change the value from 128 to 1024 at https://elixir.bootlin.com/linux/v6.2/source/include/linux/swiotlb.h#L25. Then recompile the guest kernel and boot.

Furthermore, try adding swiotlb=131072,force to qemu parameter -append to increase the size of swiotlb. (For example, -append "swiotlb=131072,force").

Notice: modifying the kernel or not using the default swiotlb size may hurt the performance.

Thanks Yifan. Actually, I tried to set

Tan-YiFan commented 1 year ago

@RodgerZhu Check for TDX VMs is added in 535.98:

Updating to the latest version of Nvidia driver may help.

wdsun1008 commented 1 year ago

I think it would be useful to combine CVM with non-CC GPUs. It may not be entirely safe, but it could be considered as an option to GPU more widely used. when I examined the code of Nvidia Open GPU Kernel Modules, I found that Nvidia has implemented checks and processing for SEV, presumably decrypting the relevant memory. Like code in nv-vm.c, when unencrypted set to true(should be true inside sev), all the allocations go to dma_alloc_coherent, which should make memory decrypted. All the maps go to nv_adjust_pgprot, and make memory decrypted. But when I use 3090 with AMD SEV, after GPU processing, the data turns into ciphertext. When I use SNP, I encounter error Unsupported exit-code 0x404 in #VC exception, which seems to occur when memory is set as shared and pvalidate is called, resulting in the memory being invalidated. I think that decrypted memory shouldn’t trigger #VC exception. @Tan-YiFan Any suggestions?

Tan-YiFan commented 1 year ago

@wdsun1008 I do not have access to CVM+GPU, so I cannot reproduce this problem. I guess:

Whether a page is private or shared is controlled by the C-bit of stage-1 page table. dma_alloc_coherent makes the kernel-mode VA become shared. However, Cuda uses user-level VA and their page table differ. So accessing the page in user-mode might trigger #VC, because the C-bit in user-level page table remains 1 as default.
The user-level instruction might be a DMA request. DMA of private memory might cause #VC.

wdsun1008 commented 1 year ago

Thanks for your reply, is there any functions to clear user space C-bit?

在 2023年9月7日星期四，Jimmy Tan @.***> 写道：

@wdsun1008 https://github.com/wdsun1008 I do not have access to CVM+GPU, so I cannot reproduce this problem. I make two guesses:

Whether a page is private or shared is controlled by the C-bit of stage-1 page table. dma_alloc_coherent makes the kernel-mode VA become shared. However, Cuda uses user-level VA and their page table differ. So accessing the page in user-mode might trigger #VC, because the C-bit in user-level page table remains 1 as default.

The user-level instruction might be a DMA request. DMA of private memory might cause #VC.

— Reply to this email directly, view it on GitHub https://github.com/NVIDIA/open-gpu-kernel-modules/issues/531#issuecomment-1710055171, or unsubscribe https://github.com/notifications/unsubscribe-auth/AUQQDLNJXDW7GDHRPAAU6CLXZG4ARANCNFSM6AAAAAA2L2V7SM . You are receiving this because you were mentioned.Message ID: @.***>

Tan-YiFan commented 1 year ago

@wdsun1008

I am trying to implement clearing user-space C-bit. I did not find an existing interface.

You can try executing some simple user-space code to locate the problem (https://github.com/AMDESE/AMDSEV/issues/177#issuecomment-1709645996)

wdsun1008 commented 1 year ago

@wdsun1008

I am trying to implement clearing user-space C-bit. I did not find an existing interface.

You can try executing some simple user-space code to locate the problem (AMDESE/AMDSEV#177 (comment))

Here's some of my simple tests on SEV (without SNP, most test cause #VC 404):

malloc cpu mem and cudaMalloc gpu mem, cudaMemcpy to device, cudaMemcpy to host, print value is ciphertext

cudaMallocManaged UVM mem, cuda kernel function to process the mem, cuda returns "an illegal memory access was encountered" with dmesg:

nvidia 0000:05:00.0: swiotlb buffer is full (sz: 2097152 bytes), total 524288 (slots), used 564 (slots)
[65000.453596] NVRM: GPU at PCI:0000:05:00: GPU-54ca9673-89a7-afd6-a37f-cb6c3c3f1f48
[65000.453605] NVRM: Xid (PCI:0000:05:00): 31, pid=19132, name=a.out, Ch 00000006, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_0 faulted @ 0x7fc0_a2000000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ

swiotlb was adjusted to 1024MB

Tan-YiFan commented 1 year ago

@wdsun1008 Test 1 shows that cudaMalloc and cudaMemcpy would cause DMA from private memory. Test 2 shows that cudaMallocManaged make use of swiotlb. I suggest adding some debug information in linux/kernel/dma/swiotlb.c to find out why swiotlb buffer is full. Changing IO_TLB_SEGSIZE from 128 to 1024 works for me but fails for others. I believe fixing the swiotlb issue would pass the UVM test.

wdsun1008 commented 1 year ago

@Tan-YiFan dcu-patch Here is a patch of Hygon DCU kernel, which implemented user space decrypt function. They don't have any reference in kernel code, maybe the function can be called by device driver to decrypt memory?

Tan-YiFan commented 1 year ago

@wdsun1008 In this patch, __set_memory_enc_dec_user is almost the same as __set_memory_enc_dec except that the page table pointer is passed as a function parameter. This function should be called in kernel-space because it modifies the page table.

It could be called by device driver. The user-space can use ioctl to pass the user-space virtual address to this function.

wdsun1008 commented 1 year ago

@Tan-YiFan I tried using a simple ko to perform user-space memory decryption, but the GPU computation still returns encrypted text. Here is my test code:

# memko.c
#include <linux/ioctl.h>
#include <linux/fs.h>
#include <linux/uaccess.h>
#include <linux/slab.h>
#include <linux/kernel.h>
#include <linux/module.h>
#include <linux/mm.h>
#include <asm/set_memory.h>

#define IOCTL_MEM_DECRYPT _IOW('k', 1, unsigned long)

static long device_ioctl(struct file *file, unsigned int ioctl_num, unsigned long ioctl_param) 
{
    unsigned long user_addr;
    unsigned long user_size;

    switch (ioctl_num) {
        case IOCTL_MEM_DECRYPT:
            // Copy the address and size from user space
            if (copy_from_user(&user_addr, (unsigned long *)ioctl_param, sizeof(unsigned long)) != 0)
                return -EFAULT;

            if (copy_from_user(&user_size, (unsigned long *)(ioctl_param + sizeof(unsigned long)), sizeof(unsigned long)) != 0)
                return -EFAULT;

            printk(KERN_INFO "Received address: 0x%lx, size: %lu\n", user_addr, user_size);

            // Convert the size to number of pages
            unsigned long numberOfPages = user_size / PAGE_SIZE;
            if (user_size % PAGE_SIZE != 0) {
                ++numberOfPages;
            }

            // Obtain the current process's mm_struct
            struct mm_struct *mm = current->mm;

            int ret = set_memory_decrypted_userspace(user_addr, numberOfPages, mm);
            printk("decrypt %d\n", ret);
            return 0;
            break;

        default:
            return -ENOTTY;
    }

    return 0;
}

static struct file_operations fops = 
{
    .unlocked_ioctl = device_ioctl,
};

static int __init memko_init(void) 
{
    int major;
    major = register_chrdev(0, "memko", &fops);
    if (major < 0) {
        printk ("Registering the character device failed with %d\n", major);
        return major;
    }

    printk("The major number is %d.\n", major);
    return 0;
}

static void __exit memko_exit(void) 
{
    unregister_chrdev(0, "memko");
}

module_init(memko_init);
module_exit(memko_exit);

MODULE_LICENSE("GPL");

# test.cu
#include <stdio.h> 
#include <stdlib.h>
#include <fcntl.h>
#include <errno.h>
#include <sys/ioctl.h>
#define IOCTL_MEM_DECRYPT _IOW('k', 1, unsigned long)
__global__ void helloCUDA(int* a) 
{
     a[threadIdx.x] += 2;
}
void HANDLE_ERROR(cudaError_t cuda_error_code){
    if(cuda_error_code != cudaSuccess) 
        printf("[E] CUDA返回错误: %s\n", cudaGetErrorString(cuda_error_code));
}

int decryptm(int fd, unsigned long addr, unsigned long size) {
    unsigned long args[2];
    args[0] = addr;
    args[1] = size;
    int retval=ioctl(fd,IOCTL_MEM_DECRYPT,args);  
    if(retval==-1)  
    {  
        perror("ioctl error\n");  
        exit(-1);  
    }  
}

int main() 
{ 
    int             *a, *dev_a;
    int             deviceId;
    int fd;
    int retval; 

    fd=open("/dev/memko", O_RDWR);  
    if(fd==-1)  
    {  
        perror("error open\n");  
        exit(-1);  
    }  
    printf("open /dev/memko successfully\n"); 

    HANDLE_ERROR(cudaGetDevice(&deviceId)); 

    a = (int*)malloc(10 * sizeof(*a));    
    retval = decryptm(fd, (unsigned long)a, 10 * sizeof(*a));
    if (retval != 0) {
        perror("error decrypt\n");  
        exit(-1); 
    }
    for(int i = 0; i < 10; i++) {
        a[i] = i;
        printf("%d,", a[i]);
    }

    HANDLE_ERROR(cudaMalloc((void**)&dev_a,
        10 * sizeof(*dev_a)));

    HANDLE_ERROR(cudaMemcpy(dev_a, a,
            10 * sizeof(*dev_a),
            cudaMemcpyHostToDevice));
    helloCUDA<<<1, 10>>>(dev_a);

    HANDLE_ERROR(cudaDeviceSynchronize());

    HANDLE_ERROR(cudaMemcpy(a, dev_a,
            10 * sizeof(*dev_a),
            cudaMemcpyDeviceToHost));

    for(int i = 0; i < 10; i++) {
        printf("%d,", a[i]);
    }
    return 0;
}

Tan-YiFan commented 1 year ago

@wdsun1008 I am trying to check whether the host hypervisor (kvm) could get the plain text of user-space data in CVM. Thanks for your code.

Tan-YiFan commented 11 months ago

@wdsun1008 I am sorry for not testing it successfully. You can refer to https://github.com/AMDESE/AMDSEV/issues/185, which is a similar issue and has been handled.

wdsun1008 commented 11 months ago

@wdsun1008 I am sorry for not testing it successfully. You can refer to AMDESE/AMDSEV#185, which is a similar issue and has been handled.

No worries, I haven't been successful either, it seems like we might need to rely on the future Trusted Device/TEE IO solution. I have taken note of that issue and will continue to monitor any progress related to it. If there are any updates, I will keep you informed through the relevant issue.

arronwy commented 10 months ago

We upgrade to use latest NV driver:

GPU: A10 CPU: Intel CPU w/ TDX Install Latest driver 535.129.03 in TDVM.

lspci
02:00.0 3D controller: NVIDIA Corporation GA102GL [A10] (rev a1)

lsmod
Module                  Size  Used by
nvidia_modeset       1282048  0
nvidia_drm             16384  0
nvidia_uvm           1396736  0
nvidia              56565760  2 nvidia_uvm,nvidia_modeset

dmesg
[   60.988565] nvidia: loading out-of-tree module taints kernel.
[   60.988575] nvidia: module license 'NVIDIA' taints kernel.
[   60.988576] Disabling lock debugging due to kernel taint
[   61.195354] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[   61.218186] nvidia-nvlink: Nvlink Core is being initialized, major device number 245

[   61.219818] ACPI: \_SB_.GSIG: Enabled at IRQ 22
[   61.219984] nvidia 0000:02:00.0: enabling device (0140 -> 0142)
[   62.083636] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  535.129.03  Thu Oct 19 18:56:32 UTC 2023
[   62.135673] nvidia_uvm: module uses symbols nvUvmInterfaceDisableAccessCntr from proprietary module nvidia, inheriting taint.
[   62.139783] nvidia-uvm: Loaded the UVM driver, major device number 243.
[   62.176665] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  535.129.03  Thu Oct 19 18:42:12 UTC 2023

Run cmd "nvidia-smi"

No devices were found
dmesg
[   62.176665] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  535.129.03  Thu Oct 19 18:42:12 UTC 2023
[  162.297760] nvidia 0000:02:00.0: Direct firmware load for nvidia/535.129.03/gsp_ga10x.bin failed with error -2
[  162.310423] ACPI Warning: \_SB.PCI0.S30.S00._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20221020/nsarguments-61)
[  167.286673] NVRM: GPU 0000:02:00.0: RmInitAdapter failed! (0x25:0x65:1470)
[  167.287591] NVRM: GPU 0000:02:00.0: rm_init_adapter failed, device minor number 0
[  167.303284] nvidia 0000:02:00.0: Direct firmware load for nvidia/535.129.03/gsp_ga10x.bin failed with error -2
[  172.022455] NVRM: GPU 0000:02:00.0: RmInitAdapter failed! (0x25:0x65:1470)
[  172.023363] NVRM: GPU 0000:02:00.0: rm_init_adapter failed, device minor number 0

@Tan-YiFan any suggestions?

Tan-YiFan commented 10 months ago

@arronwy Here is some of the information acquired from your log:

[ 162.297760] nvidia 0000:02:00.0: Direct firmware load for nvidia/535.129.03/gsp_ga10x.bin failed with error -2

This firmware should be stored at /usr/lib/firmware/nvidia.

[ 167.286673] NVRM: GPU 0000:02:00.0: RmInitAdapter failed! (0x25:0x65:1470)

0x25 => RM_INIT_GPU_LOAD_FAILED, 0x65 => NV_ERR_TIMEOUT

Is the driver installed by NVIDIA-Linux-x86_64-535.129.03.run (or cuda installer) without adding parameter -m=kernel-open? If so, I suggest installing the driver by either sh NVIDIA-Linux-x86_64-535.129.03.run -m=kernel-open, or git clone this repo and checkout the version and make modules -j $(nproc) and make modules_install

arronwy commented 10 months ago

Thanks @Tan-YiFan , I rebuild the kernel module with -m=kernel-open parameter as you mentioned and ensure firmware gsp_ga10x.bin exists in the Guest OS, but seems still can not find the firmware and have new error message:

ls -alh /usr/lib/firmware/nvidia/535.129.03/gsp_ga10x.bin
-r--r--r-- 1 root root 37M Nov 17 07:40 /usr/lib/firmware/nvidia/535.129.03/gsp_ga10x.bin

md5sum /usr/lib/firmware/nvidia/535.129.03/gsp_ga10x.bin
baca3ef5eba805553186c9322c172fa1  /usr/lib/firmware/nvidia/535.129.03/gsp_ga10x.bin

[   42.842148] nvidia-uvm: Loaded the UVM driver, major device number 243.
[   43.722516] nvidia-modeset: Loading NVIDIA UNIX Open Kernel Mode Setting Driver for x86_64  535.129.03  Release Build  (dvs-builder@U16-I3-B15-1-1)  Thu Oct 19 18:46:10 UTC 2023
[   59.137950] nvidia 0000:02:00.0: Direct firmware load for nvidia/535.129.03/gsp_ga10x.bin failed with error -2
[   59.137957] NVRM RmFetchGspRmImages: No firmware image found
[   59.137961] NVRM: GPU 0000:02:00.0: RmInitAdapter failed! (0x61:0x56:1594)
[   59.138755] NVRM: GPU 0000:02:00.0: rm_init_adapter failed, device minor number 0

My driver build command:

./NVIDIA-Linux-x86_64-535.129.03.run -x && cd NVIDIA-Linux-x86_64-535.129.03
./nvidia-installer -a -q --ui=none \
 --no-cc-version-check \
 --no-opengl-files --no-install-libglvnd \
 -m=kernel-open \
 --kernel-source-path=

Tan-YiFan commented 10 months ago

@arronwy According to this line of log:

[ 59.137950] nvidia 0000:02:00.0: Direct firmware load for nvidia/535.129.03/gsp_ga10x.bin failed with error -2

It is at https://elixir.bootlin.com/linux/v6.6/source/drivers/base/firmware_loader/main.c#L905. The return value -2 is from #define ENOENT 2 /* No such file or directory */

I suggest the following steps:

Checking: /lib should exist and symbolic link to /usr/lib.
Adding debug prints in the guest kernel near the function _request_firmware.

arronwy commented 10 months ago

Thanks @Tan-YiFan , I change the firmware path to "/lib/firmware", nvidia-smi works:, I do deviceQuery also passed, but run other sample cuda apps will have error:

nvidia-smi
Fri Nov 17 08:12:55 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A10                     Off | 00000000:02:00.0 Off |                    0 |
|  0%   52C    P0              59W / 150W |      4MiB / 23028MiB |      4%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "NVIDIA A10"
  CUDA Driver Version / Runtime Version          12.2 / 12.2
  CUDA Capability Major/Minor version number:    8.6
  Total amount of global memory:                 22516 MBytes (23609475072 bytes)
  (072) Multiprocessors, (128) CUDA Cores/MP:    9216 CUDA Cores
  GPU Max Clock rate:                            1695 MHz (1.70 GHz)
  Memory Clock rate:                             6251 Mhz
  Memory Bus Width:                              384-bit
  L2 Cache Size:                                 6291456 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total shared memory per multiprocessor:        102400 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1536
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Enabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Managed Memory:                Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 2 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 12.2, CUDA Runtime Version = 12.2, NumDevs = 1
Result = PASS

./bf16TensorCoreGemm
CUDA error at ../../../Common/helper_cuda.h:888 code=801(cudaErrorNotSupported) "cudaSetDevice(devID)"
Initializing...

[  160.280094] NVRM nvCheckOkFailedNoLog: Check failed: Call not supported [NV_ERR_NOT_SUPPORTED] (0x00000056) returned from _memdescAllocInternal(pMemDesc) @ mem_desc.c:1326
[  160.280099] NVRM sysmemConstruct_IMPL: *** Cannot allocate sysmem through fb heap
[  174.601818] NVRM nvCheckOkFailedNoLog: Check failed: Call not supported [NV_ERR_NOT_SUPPORTED] (0x00000056) returned from _memdescAllocInternal(pMemDesc) @ mem_desc.c:1326

Tan-YiFan commented 10 months ago

@arronwy The error flag is NV_ERR_NOT_SUPPORTED but I could not find which line of code set this flag.

Below is my guess:

The Cannot allocate sysmem through fb heap is at https://github.com/NVIDIA/open-gpu-kernel-modules/blob/535.129.03/src/nvidia/src/kernel/mem_mgr/system_mem.c#L226, around which (at line 212) is code related to CVM:

    if ((sysGetStaticConfig(SYS_GET_INSTANCE()))->bOsCCEnabled &&
        gpuIsCCorApmFeatureEnabled(pGpu) &&
        FLD_TEST_DRF(OS32, _ATTR2, _MEMORY_PROTECTION, _UNPROTECTED,
                     pAllocData->attr2))
        {
            memdescSetFlag(pMemDesc, MEMDESC_FLAGS_ALLOC_IN_UNPROTECTED_MEMORY,
                           NV_TRUE);
        }

bOsCCEnabled is equal to os_cc_enabled, os_cc_enabled is set at nv_detect_conf_compute_platform and should be 1.
gpuIsCCorApmFeatureEnabled might be 0 because the GPU is not H100.

To solve this issue, I would try hacking into the Nvidia kernel module:

Which line of code set the error flag NV_ERR_NOT_SUPPORTED?
Is it related to the flag MEMDESC_FLAGS_ALLOC_IN_UNPROTECTED_MEMORY?

What's more, using the version 535.129.03 is not suggested. See https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-535-129-03/index.html#known-issues (search "confidential"). Nvidia suggests 535.104.05.

arronwy commented 10 months ago

Thanks @Tan-YiFan , I tried with 535.104.05 seems have the same error:

nvidia-smi
Fri Nov 17 09:36:51 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A10                     Off | 00000000:02:00.0 Off |                    0 |
|  0%   51C    P0              56W / 150W |      4MiB / 23028MiB |      6%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

./bf16TensorCoreGemm
CUDA error at ../../../Common/helper_cuda.h:888 code=801(cudaErrorNotSupported) "cudaSetDevice(devID)"
Initializing...

dmesg
[   36.591847] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[   36.595762] nvidia-nvlink: Nvlink Core is being initialized, major device number 245

[   36.596639] ACPI: \_SB_.GSIG: Enabled at IRQ 22
[   36.596801] nvidia 0000:02:00.0: enabling device (0140 -> 0142)
[   37.449941] NVRM: loading NVIDIA UNIX Open Kernel Module for x86_64  535.104.05  Release Build  (dvs-builder@U16-I2-C04-35-2)  Sat Aug 19 01:13:27 UTC 2023
[   37.499394] nvidia-uvm: Loaded the UVM driver, major device number 243.
[   37.562121] nvidia-modeset: Loading NVIDIA UNIX Open Kernel Mode Setting Driver for x86_64  535.104.05  Release Build  (dvs-builder@U16-I2-C04-35-2)  Sat Aug 19 01:03:29 UTC 2023
[   49.331252] ACPI Warning: \_SB.PCI0.S30.S00._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20221020/nsarguments-61)
[  117.335047] NVRM nvCheckOkFailedNoLog: Check failed: Call not supported [NV_ERR_NOT_SUPPORTED] (0x00000056) returned from _memdescAllocInternal(pMemDesc) @ mem_desc.c:1326
[  117.335051] NVRM sysmemConstruct_IMPL: *** Cannot allocate sysmem through fb heap
[  143.503077] NVRM nvCheckOkFailedNoLog: Check failed: Call not supported [NV_ERR_NOT_SUPPORTED] (0x00000056) returned from _memdescAllocInternal(pMemDesc) @ mem_desc.c:1326
[  143.503081] NVRM sysmemConstruct_IMPL: *** Cannot allocate sysmem through fb heap

arronwy commented 10 months ago

Hi @Tan-YiFan I rebuilt the driver version 535.104.05 with below patch:

git diff
diff --git a/src/nvidia/src/kernel/mem_mgr/system_mem.c b/src/nvidia/src/kernel/mem_mgr/system_mem.c
index 250dc400c8a0..6e67422bdf7e 100644
--- a/src/nvidia/src/kernel/mem_mgr/system_mem.c
+++ b/src/nvidia/src/kernel/mem_mgr/system_mem.c
@@ -209,14 +209,8 @@ sysmemConstruct_IMPL

     memdescSetFlag(pMemDesc, MEMDESC_FLAGS_SYSMEM_OWNED_BY_CLIENT, NV_TRUE);

-    if ((sysGetStaticConfig(SYS_GET_INSTANCE()))->bOsCCEnabled &&
-        gpuIsCCorApmFeatureEnabled(pGpu) &&
-        FLD_TEST_DRF(OS32, _ATTR2, _MEMORY_PROTECTION, _UNPROTECTED,
-                     pAllocData->attr2))
-        {
-            memdescSetFlag(pMemDesc, MEMDESC_FLAGS_ALLOC_IN_UNPROTECTED_MEMORY,
+    memdescSetFlag(pMemDesc, MEMDESC_FLAGS_ALLOC_IN_UNPROTECTED_MEMORY,
                            NV_TRUE);
-        }

     memdescSetGpuCacheAttrib(pMemDesc, gpuCacheAttrib);

@@ -224,7 +218,7 @@ sysmemConstruct_IMPL
     if (rmStatus != NV_OK)
     {
         NV_PRINTF(LEVEL_ERROR,
-                  "*** Cannot allocate sysmem through fb heap\n");
+                  "*** Cannot allocate sysmem through fb heap3\n");
         memdescFree(pMemDesc);
         memdescDestroy(pMemDesc);
         goto failed;

still have this error:

[ 1117.665051] nvidia-modeset: Unloading
[ 1117.668337] nvidia-uvm: Unloaded the UVM driver.
[ 1117.670403] nvidia-nvlink: Unregistered Nvlink Core, major device number 245
[ 1637.798500] nvidia-nvlink: Nvlink Core is being initialized, major device number 245
[ 1637.798507] NVRM: loading NVIDIA UNIX Open Kernel Module for x86_64  535.104.05  Release Build  (root@localhost)  Fri Nov 17 11:12:31 UTC 2023
[ 1637.901119] nvidia-uvm: Loaded the UVM driver, major device number 243.
[ 1638.591124] nvidia-modeset: Loading NVIDIA UNIX Open Kernel Mode Setting Driver for x86_64  535.104.05  Release Build  (root@localhost)  Fri Nov 17 11:08:44 UTC 2023
[ 1682.309982] NVRM nvCheckOkFailedNoLog: Check failed: Call not supported [NV_ERR_NOT_SUPPORTED] (0x00000056) returned from _memdescAllocInternal(pMemDesc) @ mem_desc.c:1326
[ 1682.309985] NVRM sysmemConstruct_IMPL: *** Cannot allocate sysmem through fb heap3

Tan-YiFan commented 10 months ago

@arronwy I'm sorry but I could not solve this problem. I do not have access to TDX machines so I could not reproduce this problem. Checking the source of NV_ERR_NOT_SUPPORTED might help.

arronwy commented 10 months ago

@arronwy I'm sorry but I could not solve this problem. I do not have access to TDX machines so I could not reproduce this problem. Checking the source of NV_ERR_NOT_SUPPORTED might help.

Thanks @Tan-YiFan ,

I added below debug info:

diff --git a/src/nvidia/arch/nvalloc/unix/src/os.c b/src/nvidia/arch/nvalloc/unix/src/os.c
index bb03eac64e06..94ad9e4f3e08 100644
--- a/src/nvidia/arch/nvalloc/unix/src/os.c
+++ b/src/nvidia/arch/nvalloc/unix/src/os.c
@@ -923,6 +923,7 @@ NV_STATUS osAllocPagesInternal(
             memdescGetGuestId(pMemDesc),
             memdescGetPteArray(pMemDesc, AT_CPU),
             &pMemData);
+            NV_PRINTF(LEVEL_ERROR, "%s: osAllocPagesInternal MEMDESC_FLAGS_GUEST_ALLOCATED %d\n", __FUNCTION__, status);
     }
     else
     {
@@ -962,6 +963,8 @@ NV_STATUS osAllocPagesInternal(
                 nodeId,
                 memdescGetPteArray(pMemDesc, AT_CPU),
                 &pMemData);
+
+            NV_PRINTF(LEVEL_ERROR, "%s: osAllocPagesInternal unencrypted %d\n", __FUNCTION__, status);
         }

         if (nv && nv->force_dma32_alloc)

And dmesg shows:

dmesg|grep osAllocPagesInternal
[ 4368.348385] NVRM osAllocPagesInternal: osAllocPagesInternal: osAllocPagesInternal unencrypted 0
[ 4368.348583] NVRM osAllocPagesInternal: osAllocPagesInternal: osAllocPagesInternal unencrypted 0
[ 4368.361951] NVRM osAllocPagesInternal: osAllocPagesInternal: osAllocPagesInternal unencrypted 0
[ 4368.362486] NVRM osAllocPagesInternal: osAllocPagesInternal: osAllocPagesInternal unencrypted 0
[ 4368.362968] NVRM osAllocPagesInternal: osAllocPagesInternal: osAllocPagesInternal unencrypted 0
[ 4368.363449] NVRM osAllocPagesInternal: osAllocPagesInternal: osAllocPagesInternal unencrypted 0
[ 4368.363608] NVRM osAllocPagesInternal: osAllocPagesInternal: osAllocPagesInternal unencrypted 0
[ 4368.363779] NVRM osAllocPagesInternal: osAllocPagesInternal: osAllocPagesInternal unencrypted 0
[ 4368.363884] NVRM osAllocPagesInternal: osAllocPagesInternal: osAllocPagesInternal unencrypted 0
[ 4368.462940] NVRM osAllocPagesInternal: osAllocPagesInternal: osAllocPagesInternal unencrypted 0
[ 4368.463516] NVRM osAllocPagesInternal: osAllocPagesInternal: osAllocPagesInternal unencrypted 0
[ 4368.464192] NVRM osAllocPagesInternal: osAllocPagesInternal: osAllocPagesInternal unencrypted 0
[ 4368.464584] NVRM osAllocPagesInternal: osAllocPagesInternal: osAllocPagesInternal unencrypted 0
[ 4368.464840] NVRM osAllocPagesInternal: osAllocPagesInternal: osAllocPagesInternal unencrypted 0
[ 4369.394464] NVRM osAllocPagesInternal: osAllocPagesInternal: osAllocPagesInternal unencrypted 0
[ 4371.180462] NVRM osAllocPagesInternal: osAllocPagesInternal: osAllocPagesInternal unencrypted 0
[ 4371.180967] NVRM osAllocPagesInternal: osAllocPagesInternal: osAllocPagesInternal unencrypted 0
[ 4371.199681] NVRM osAllocPagesInternal: osAllocPagesInternal: osAllocPagesInternal unencrypted 0
[ 4371.213703] NVRM osAllocPagesInternal: osAllocPagesInternal: osAllocPagesInternal unencrypted 0
[ 4371.217923] NVRM osAllocPagesInternal: osAllocPagesInternal: osAllocPagesInternal unencrypted 0
[ 4371.218577] NVRM osAllocPagesInternal: osAllocPagesInternal: osAllocPagesInternal unencrypted 0
[ 4371.218818] NVRM osAllocPagesInternal: osAllocPagesInternal: osAllocPagesInternal unencrypted 0
[ 4371.226344] NVRM osAllocPagesInternal: osAllocPagesInternal: osAllocPagesInternal unencrypted 0
[ 4371.228031] NVRM osAllocPagesInternal: osAllocPagesInternal: osAllocPagesInternal unencrypted 0
[ 4371.228280] NVRM osAllocPagesInternal: osAllocPagesInternal: osAllocPagesInternal unencrypted 0
[ 4371.342967] NVRM osAllocPagesInternal: osAllocPagesInternal: osAllocPagesInternal unencrypted 0
[ 4371.347899] NVRM osAllocPagesInternal: osAllocPagesInternal: osAllocPagesInternal unencrypted 0
[ 4371.348141] NVRM osAllocPagesInternal: osAllocPagesInternal: osAllocPagesInternal unencrypted 0
[ 4371.385904] NVRM osAllocPagesInternal: osAllocPagesInternal: osAllocPagesInternal unencrypted 0
[ 4371.388129] NVRM osAllocPagesInternal: osAllocPagesInternal: osAllocPagesInternal unencrypted 0
[ 4371.417296] NVRM osAllocPagesInternal: osAllocPagesInternal: osAllocPagesInternal unencrypted 0
[ 4371.430932] NVRM osAllocPagesInternal: osAllocPagesInternal: osAllocPagesInternal unencrypted 0
[ 4371.845001] NVRM osAllocPagesInternal: osAllocPagesInternal: osAllocPagesInternal unencrypted 0
[ 4371.859165] NVRM osAllocPagesInternal: osAllocPagesInternal: osAllocPagesInternal unencrypted 0
[ 4371.859421] NVRM osAllocPagesInternal: osAllocPagesInternal: osAllocPagesInternal unencrypted 0
[ 4371.860246] NVRM osAllocPagesInternal: osAllocPagesInternal: osAllocPagesInternal unencrypted 0
[ 4371.861170] NVRM osAllocPagesInternal: osAllocPagesInternal: osAllocPagesInternal unencrypted 0
[ 4371.864404] NVRM osAllocPagesInternal: osAllocPagesInternal: osAllocPagesInternal unencrypted 0
[ 4371.867470] NVRM osAllocPagesInternal: osAllocPagesInternal: osAllocPagesInternal unencrypted 0
[ 4371.867713] NVRM osAllocPagesInternal: osAllocPagesInternal: osAllocPagesInternal unencrypted 0
[ 4371.868613] NVRM osAllocPagesInternal: osAllocPagesInternal: osAllocPagesInternal unencrypted 0
[ 4371.871809] NVRM osAllocPagesInternal: osAllocPagesInternal: osAllocPagesInternal unencrypted 0
[ 4371.895995] NVRM osAllocPagesInternal: osAllocPagesInternal: osAllocPagesInternal unencrypted 0
[ 4371.896265] NVRM osAllocPagesInternal: osAllocPagesInternal: osAllocPagesInternal unencrypted 0
[ 4371.897869] NVRM osAllocPagesInternal: osAllocPagesInternal: osAllocPagesInternal unencrypted 86

any suggestions?

Tan-YiFan commented 10 months ago

@arronwy osAllocPagesInternal would call nv_alloc_pages (in file kernel-open/nvidia/nv.c). Debugging into it further might help.

arronwy commented 9 months ago

nvidia

I add below debug info:

diff --git a/kernel-open/nvidia/nv-mmap.c b/kernel-open/nvidia/nv-mmap.c
index 152b22add538..c93ac357371f 100644
--- a/kernel-open/nvidia/nv-mmap.c
+++ b/kernel-open/nvidia/nv-mmap.c
@@ -347,6 +347,10 @@ int nv_encode_caching(
              * translates to the effective memory type WC if a WC MTRR
              * exists or else UC.
              */
+
+                nv_printf(NV_DBG_ERRORS,
+                    "NVRM: VM: memory type %d WC support is unavailable!\n",
+                    memory_type);
             return 1;
 #endif
         case NV_MEMORY_CACHED:

And dmesg shows:

[ 1442.105340] NVRM _memdescAllocInternal: ADDR_SYSMEM begin.
[ 1442.105341] NVRM: VM: nv_alloc_pages: 12288 pages, nodeid -1
[ 1442.105463] NVRM: VM:    contig 0  cache_type 2
[ 1442.105464] NVRM: VM: memory type 0 WC support is unavailable!
[ 1442.105465] NVRM osAllocPagesInternal: osAllocPagesInternal: osAllocPagesInternal unencrypted 86
[ 1442.105466] NVRM _memdescAllocInternal: _osAllocPages failed .
[ 1442.105467] NVRM _memdescAllocInternal: Done status 86 .
[ 1442.105469] NVRM nvCheckOkFailedNoLog: Check failed: Call not supported [NV_ERR_NOT_SUPPORTED] (0x00000056) returned from _memdescAllocInternal(pMemDesc) @ mem_desc.c:1329
[ 1442.105470] NVRM sysmemConstruct_IMPL: *** Cannot allocate sysmem through fb heap

seems "WC support is unavailable"?

Tan-YiFan commented 9 months ago

@arronwy NV_ALLOW_WRITE_COMBINING(memory_type = 0) returns false. So nv_pat_mode equals to NV_PAT_MODE_DISABLED. See nv_determine_pat_mode (https://github.com/NVIDIA/open-gpu-kernel-modules/blob/535.104.05/kernel-open/nvidia/nv-pat.c#L349).

This might be caused by:

(Unlikely) The guest kernel is built without flag CONFIG_X86_PAT.
Some issue with IA32_PAT msr virtualization.

arronwy commented 9 months ago

@arronwy NV_ALLOW_WRITE_COMBINING(memory_type = 0) returns false. So nv_pat_mode equals to NV_PAT_MODE_DISABLED. See nv_determine_pat_mode (https://github.com/NVIDIA/open-gpu-kernel-modules/blob/535.104.05/kernel-open/nvidia/nv-pat.c#L349).

This might be caused by:

(Unlikely) The guest kernel is built without flag CONFIG_X86_PAT.

Some issue with IA32_PAT msr virtualization.

Thanks @Tan-YiFan , Yes I found CONFIG_X86_PAT is enabled in guest kernel config, then NV_ENABLE_BUILTIN_PAT_SUPPORT is 0:

#if defined(CONFIG_X86_PAT)
#define NV_ENABLE_BUILTIN_PAT_SUPPORT 0
#else
#define NV_ENABLE_BUILTIN_PAT_SUPPORT 1
#endif

I add below debug info:

diff --git a/kernel-open/nvidia/nv-pat.c b/kernel-open/nvidia/nv-pat.c
index 1fa530d9cce6..a3eacdd15ee1 100644
--- a/kernel-open/nvidia/nv-pat.c
+++ b/kernel-open/nvidia/nv-pat.c
@@ -351,6 +351,8 @@ static int nv_determine_pat_mode(void)
     unsigned int pat1, pat2, i;
     NvU8 PAT_WC_index;

+    nv_printf(NV_DBG_ERRORS,
+        "NVRM: nv_determine_pat_mode nv_pat_mode: %d.\n", nv_pat_mode);
     if (!test_bit(X86_FEATURE_PAT,
             (volatile unsigned long *)&boot_cpu_data.x86_capability))
     {
@@ -365,6 +367,8 @@ static int nv_determine_pat_mode(void)
         }
     }

+    nv_printf(NV_DBG_ERRORS,
+        "NVRM: nv_determine_pat_mode CPU support the PAT: %d.\n", nv_pat_mode);
     NV_READ_PAT_ENTRIES(pat1, pat2);
     PAT_WC_index = 0xf;

@@ -383,6 +387,8 @@ static int nv_determine_pat_mode(void)
         }
     }

+    nv_printf(NV_DBG_ERRORS,
+        "NVRM: nv_determine_pat_mode PAT_WC_index: %d.\n", PAT_WC_index);
     if (PAT_WC_index == 1)
     {
         return NV_PAT_MODE_KERNEL;
@@ -452,6 +458,9 @@ int nv_init_pat_support(nvidia_stack_t *sp)
     if (!disable_pat)
     {
         nv_enable_pat_support();
+
+        nv_printf(NV_DBG_ERRORS,
+            "NVRM: nv_init_pat_support %d.\n", nv_pat_mode);
         if (nv_pat_mode == NV_PAT_MODE_BUILTIN)
         {
              ret = nvidia_register_cpu_hotplug_notifier();

and dmesg shows:

[ 1500.343954] nvidia-nvlink: Nvlink Core is being initialized, major device number 245
[ 1500.343960] NVRM: nv_determine_pat_mode nv_pat_mode: 0.
[ 1500.344786] NVRM: nv_determine_pat_mode CPU support the PAT: 0.
[ 1500.344787] NVRM: nv_determine_pat_mode PAT_WC_index: 15.
[ 1500.344788] NVRM: nv_init_pat_support 0.

arronwy commented 9 months ago

#define NV_READ_PAT_ENTRIES(pat1, pat2)   rdmsr(0x277, (pat1), (pat2))

rdmsr 0x277
7040600070406

Tan-YiFan commented 9 months ago

@arronwy Does executing rdmsr 0x277 on the host get a byte "01"?

arronwy commented 9 months ago

@arronwy Does executing rdmsr 0x277 on the host get a byte "01"?

on the host and non TEE VM

rdmsr 0x277
407050600070106

Tan-YiFan commented 9 months ago

@arronwy The second byte of 407050600070106 is "01" (PAT). Maybe TDX guest or host software or hardware disable the write combining PAT?

arronwy commented 9 months ago

@arronwy The second byte of 407050600070106 is "01" (PAT). Maybe TDX guest or host software or hardware disable the write combining PAT?

Hi @Tan-YiFan ， we found the same TDX guest environment works for Nvidia H100, for A10 the failure is due to NV_MEMORY_WRITECOMBINED is not supported in TEE guest, and with H100, the driver will not allocate this type of memory, the NV kernel driver have many place to check whether the card support APM/HCC does these check will avoid to use NV_MEMORY_WRITECOMBINED? can we avoid to use this type of memory in kernel driver for A10 too?

Tan-YiFan commented 9 months ago

Hi @arronwy , I suggest adding os_dump_stack() or dump_stack() at the failure code to get the caller of the NV_MEMORY_WRITECOMBINED request.

arronwy commented 9 months ago

os_dump_stack()

[  397.616510] CPU: 2 PID: 1754 Comm: deviceQuery Tainted: G           OE     Y 6.2.16-nvidia-gpu-tdx #16
[  397.616513] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown unknown
[  397.616514] Call Trace:
[  397.616518]  <TASK>
[  397.616519]  dump_stack_lvl+0x33/0x50
[  397.616529]  nv_encode_caching+0x144/0x170 [nvidia]
[  397.616627]  nvidia_mmap_helper+0x4b5/0x7d0 [nvidia]
[  397.616679]  nvidia_mmap+0x4c/0x80 [nvidia]
[  397.616732]  mmap_region+0x24e/0x860
[  397.616738]  do_mmap+0x3cb/0x5f0
[  397.616741]  vm_mmap_pgoff+0xc4/0x100
[  397.616745]  ksys_mmap_pgoff+0x182/0x1f0
[  397.616748]  do_syscall_64+0x40/0x90
[  397.616752]  entry_SYSCALL_64_after_hwframe+0x46/0xb0
[  397.616756] RIP: 0033:0x7fb29883ac17
[  397.616758] Code: 00 00 00 89 ef e8 59 ae ff ff eb e4 e8 42 7b 01 00 66 90 f3 0f 1e fa 41 89 ca 41 f7 c1 ff 0f 00 00 75 10 b8 09 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 21 c3 48 8b 05 e9 a1 0f 00 64 c7 00 16 00 00
[  397.616760] RSP: 002b:00007ffc3d8e0758 EFLAGS: 00000246 ORIG_RAX: 0000000000000009
[  397.616762] RAX: ffffffffffffffda RBX: 0000700000200000 RCX: 00007fb29883ac17
[  397.616763] RDX: 0000000000000003 RSI: 0000000000200000 RDI: 0000000200200000
[  397.616764] RBP: 00007ffc3d8e07c0 R08: 000000000000000c R09: 0000000000000000
[  397.616765] R10: 0000000000000011 R11: 0000000000000246 R12: 0000000000200000
[  397.616766] R13: 0000000001c15390 R14: 000000000000000c R15: 0000000200200000
[  397.616767]  </TASK>
[  397.616768] NVRM: VM: memory type 2  cache_type 2 WC support is unavailable!
[  397.616769] NVRM: VM: memory type 2 cache_type 7 nv_encode_caching passed!
[  397.616962] CPU: 2 PID: 1754 Comm: deviceQuery Tainted: G           OE     Y 6.2.16-nvidia-gpu-tdx #16
[  397.616964] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown unknown
[  397.616965] Call Trace:
[  397.616965]  <TASK>
[  397.616966]  dump_stack_lvl+0x33/0x50
[  397.616969]  nv_encode_caching+0x144/0x170 [nvidia]
[  397.617018]  nv_alloc_pages+0x6d/0x1b0 [nvidia]
[  397.617060]  osAllocPagesInternal+0x2e9/0x3f0 [nvidia]
[  397.617163]  ? osAllocPagesInternal+0x286/0x3f0 [nvidia]
[  397.617259]  ? memdescAlloc+0x177/0xe80 [nvidia]
[  397.617369]  ? memUtilsAllocMemDesc+0x48/0xe0 [nvidia]
[  397.617482]  ? sysmemAllocResources+0xdb/0x340 [nvidia]
[  397.617563]  ? sysmemAllocResources+0x109/0x340 [nvidia]
[  397.617639]  ? memdescGetPhysAddrsForGpu+0x40/0x130 [nvidia]
[  397.617740]  ? sysmemConstruct_IMPL+0x2eb/0x780 [nvidia]
[  397.617817]  ? __nvoc_objCreate_SystemMemory+0xc2/0x160 [nvidia]
[  397.617900]  ? __nvoc_objCreateDynamic+0x49/0x70 [nvidia]
[  397.617954]  ? os_alloc_mem+0xb0/0xc0 [nvidia]
[  397.618004]  ? os_alloc_mem+0xb0/0xc0 [nvidia]
[  397.618053]  ? _portMemAllocatorAlloc.part.0+0x1f/0x140 [nvidia]
[  397.618106]  ? resservResourceFactory+0x92/0x160 [nvidia]
[  397.618180]  ? _clientAllocResourceHelper+0x298/0x5c0 [nvidia]
[  397.618230]  ? serverResLock_Prologue+0x144/0x260 [nvidia]
[  397.618301]  ? _tlsThreadEntryGet+0x82/0x90 [nvidia]
[  397.618350]  ? serverAllocResourceUnderLock+0x294/0x9b0 [nvidia]
[  397.618430]  ? os_alloc_mem+0xb0/0xc0 [nvidia]
[  397.618478]  ? _serverLockClient+0x57/0x110 [nvidia]
[  397.618527]  ? _serverLockClientWithLockInfo.constprop.0+0x9a/0x1c0 [nvidia]
[  397.618576]  ? serverAllocResource+0x284/0x480 [nvidia]
[  397.618625]  ? rmapiAllocWithSecInfo+0x1cf/0x380 [nvidia]
[  397.618695]  ? rmapiAllocWithSecInfoTls+0x65/0x90 [nvidia]
[  397.618766]  ? _rmAllocForDeprecatedApi+0x25/0x30 [nvidia]
[  397.618837]  ? _rmVidHeapControlAllocCommon+0x62/0x80 [nvidia]
[  397.618956]  ? _nvos32FunctionAllocSize+0xd0/0x120 [nvidia]
[  397.619063]  ? RmDeprecatedVidHeapControl+0x73/0x80 [nvidia]
[  397.619164]  ? Nv04VidHeapControlWithSecInfo+0x35/0x40 [nvidia]
[  397.619237]  ? rmapiControlWithSecInfoTls+0xf0/0xf0 [nvidia]
[  397.619307]  ? _rmAllocForDeprecatedApi+0x30/0x30 [nvidia]
[  397.619376]  ? _rmControlForDeprecatedApi+0x30/0x30 [nvidia]
[  397.619454]  ? _rmFreeForDeprecatedApi+0x20/0x20 [nvidia]
[  397.619521]  ? RmCopyUserForDeprecatedApi+0xe0/0xe0 [nvidia]
[  397.619587]  ? _rmMapMemoryForDeprecatedApi+0x30/0x30 [nvidia]
[  397.619652]  ? _rmAllocMemForDeprecatedApi+0x10/0x10 [nvidia]
[  397.619718]  ? RmIoctl+0x88c/0xdf0 [nvidia]
[  397.619824]  ? xas_load+0x5/0xa0
[  397.619825]  ? os_get_current_tick+0x23/0x90 [nvidia]
[  397.619886]  ? os_acquire_spinlock+0x9/0x20 [nvidia]
[  397.619946]  ? portSyncSpinlockAcquire+0x1d/0x50 [nvidia]
[  397.620004]  ? rm_ioctl+0x49/0x70 [nvidia]
[  397.620110]  ? nvidia_ioctl+0xff/0x830 [nvidia]
[  397.620169]  ? nvidia_ioctl+0x605/0x830 [nvidia]
[  397.620228]  ? nvidia_frontend_unlocked_ioctl+0x2f/0x40 [nvidia]
[  397.620290]  ? __x64_sys_ioctl+0x412/0x960
[  397.620295]  ? handle_mm_fault+0xe1/0x2d0
[  397.620297]  ? do_syscall_64+0x40/0x90
[  397.620299]  ? entry_SYSCALL_64_after_hwframe+0x46/0xb0
[  397.620301]  </TASK>
[  397.620333] NVRM: VM: memory type 0  cache_type 2 WC support is unavailable!
[  397.620335] NVRM osAllocPagesInternal: osAllocPagesInternal: osAllocPagesInternal unencrypted 1 status 86
[  397.620336] NVRM _memdescAllocInternal: _memdescAllocInternal ADDR_SYSMEM failed .

RodgerZhu commented 9 months ago

@Tan-YiFan Hi, could you help check below error and anysuggestion?

[ 211.069157] ACPI Warning: _SB.PCI0.S18.S00._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20221020/nsarguments-61) [ 218.586180] NVRM _threadNodeCheckTimeout: _threadNodeCheckTimeout: currentTime: 3d08eb4e19cb00 >= 3d08ea7c4adb00 [ 218.586185] NVRM _threadNodeCheckTimeout: _threadNodeCheckTimeout: Timeout was set to: 4000 msecs! [ 218.586191] NVRM kgspBootstrapRiscvOSEarly_GH100: Timeout waiting for lockdown release. It's also possible that bootrom may have failed. RM may not have access to the BR status to be able to say for sure what failed. [ 218.586194] NVRM kfspDumpDebugState_GH100: FSP microcode v4.76 [ 218.586196] NVRM kfspDumpDebugState_GH100: NV_PFSP_FALCON_COMMON_SCRATCH_GROUP_2(0) = 0x0 [ 218.586198] NVRM kfspDumpDebugState_GH100: NV_PFSP_FALCON_COMMON_SCRATCH_GROUP_2(1) = 0x0 [ 218.586200] NVRM kfspDumpDebugState_GH100: NV_PFSP_FALCON_COMMON_SCRATCH_GROUP_2(2) = 0x0 [ 218.586202] NVRM kfspDumpDebugState_GH100: NV_PFSP_FALCON_COMMON_SCRATCH_GROUP_2(3) = 0x0 [ 218.586204] NVRM kgspBootstrapRiscvOSEarly_GH100: NV_PGSP_FALCON_MAILBOX0 = 0x0 [ 218.586205] NVRM kgspBootstrapRiscvOSEarly_GH100: NV_PGSP_FALCON_MAILBOX1 = 0x0 [ 218.586208] NVRM kgspInitRm_IMPL: cannot bootstrap riscv/gsp: 0x65 [ 218.586215] NVRM RmInitAdapter: Cannot initialize GSP firmware RM [ 222.039792] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x62:0x65:1660) [ 222.042118] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0 [ 222.073868] NVRM gpumgrCheckRmFirmwarePolicy: Disabling GSP offload -- GPU not supported [ 222.073880] NVRM osInitNvMapping: Cannot attach gpu [ 222.073882] NVRM RmInitAdapter: osInitNvMapping failed, bailing out of RmInitAdapter [ 222.073889] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x22:0x56:631) [ 222.076151] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0 [ 229.994740] NVRM gpumgrCheckRmFirmwarePolicy: Disabling GSP offload -- GPU not supported [ 229.994754] NVRM osInitNvMapping: Cannot attach gpu [ 229.994756] NVRM RmInitAdapter: osInitNvMapping failed, bailing out of RmInitAdapter [ 229.994764] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x22:0x56:631) [ 229.997304] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0 [ 230.027675] NVRM gpumgrCheckRmFirmwarePolicy: Disabling GSP offload -- GPU not supported [ 230.027686] NVRM osInitNvMapping: *** Cannot attach gpu [ 230.027688] NVRM RmInitAdapter: osInitNvMapping failed, bailing out of RmInitAdapter [ 230.027695] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x22:0x56:631) [ 230.030413] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0 [ 230.308994] nvidia-uvm: Loaded the UVM driver, major device number 235.

Tan-YiFan commented 9 months ago

@arronwy The initialization code of PAT MSR in guest is at https://github.com/intel/tdx/blob/guest-kexec/arch/x86/mm/pat/memtype.c#L264-L308. 7040600070406 equals to PAT(WB, WT, UC_MINUS, UC, WB, WT, UC_MINUS, UC) at line 293. Please check whether rdmsrl(MSR_IA32_CR_PAT, pat_msr_val); at line 270 gives 0.

Tan-YiFan commented 9 months ago

@RodgerZhu It seems that the GPU is H100. Is the guest TDX-enabled? Is the CC mode of the H100 on, off or devtools?

RodgerZhu commented 9 months ago

@RodgerZhu It seems that the GPU is H100. Is the guest TDX-enabled? Is the CC mode of the H100 on, off or devtools?

Actually, the CC mode is on and guest is tdx-enabled. It seems like the firmware in the card can't work well with the gsp installed via driver.

Tan-YiFan commented 9 months ago

@RodgerZhu Could you try non-TDX with non-CC H100 and TDX with H100 with devtools? (Run nvidia-smi and a cuda program in the guest)

RodgerZhu commented 9 months ago

CC H100 and TDX with H100 with devtools? (Run nvidia-smi and a cuda program in the guest)

Non-TDX and non-CC GPU can work normally. Can have try with devtools mode

RodgerZhu commented 9 months ago

@RodgerZhu Could you try non-TDX with non-CC H100 and TDX with H100 with devtools? (Run nvidia-smi and a cuda program in the guest)

[ 66.984596] NVRM kgspBootstrapRiscvOSEarly_GH100: Timeout waiting for lockdown release. It's also possible that bootrom may have failed. RM may not have access to the BR status to be able to say for sure what failed. [ 66.984599] NVRM kfspDumpDebugState_GH100: FSP microcode v4.76 [ 66.984600] NVRM kfspDumpDebugState_GH100: NV_PFSP_FALCON_COMMON_SCRATCH_GROUP_2(0) = 0x0 [ 66.984602] NVRM kfspDumpDebugState_GH100: NV_PFSP_FALCON_COMMON_SCRATCH_GROUP_2(1) = 0x0 [ 66.984603] NVRM kfspDumpDebugState_GH100: NV_PFSP_FALCON_COMMON_SCRATCH_GROUP_2(2) = 0x0 [ 66.984604] NVRM kfspDumpDebugState_GH100: NV_PFSP_FALCON_COMMON_SCRATCH_GROUP_2(3) = 0x0 [ 66.984606] NVRM kgspBootstrapRiscvOSEarly_GH100: NV_PGSP_FALCON_MAILBOX0 = 0x0 [ 66.984607] NVRM kgspBootstrapRiscvOSEarly_GH100: NV_PGSP_FALCON_MAILBOX1 = 0x0 [ 66.984608] NVRM kgspInitRm_IMPL: cannot bootstrap riscv/gsp: 0x65 [ 66.984612] NVRM RmInitAdapter: Cannot initialize GSP firmware RM [ 70.397756] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x62:0x65:1660) [ 70.400163] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0 [ 76.920700] NVRM kfspProcessCommandResponse_GH100: FSP response reported error. Task ID: 0x2 Command type: 0x14 Error code: 0x6 [ 76.920703] NVRM kfspSendBootCommands_GH100: Sent following content to FSP: [ 76.920705] NVRM kfspSendBootCommands_GH100: version=0x1, size=0x35c, gspFmcSysmemOffset=0x131e80000 [ 76.920706] NVRM kfspSendBootCommands_GH100: frtsSysmemOffset=0x131f00000, frtsSysmemSize=0x100000 [ 76.920706] NVRM kfspSendBootCommands_GH100: frtsVidmemOffset=0x200000, frtsVidmemSize=0x100000 [ 76.920707] NVRM kfspSendBootCommands_GH100: gspBootArgsSysmemOffset=0x1306a6000 [ 76.920708] NVRM kfspSendBootCommands_GH100: FSP boot cmds failed. RM cannot boot. [ 76.920710] NVRM kfspDumpDebugState_GH100: FSP microcode v4.76 [ 76.920712] NVRM kfspDumpDebugState_GH100: NV_PFSP_FALCON_COMMON_SCRATCH_GROUP_2(0) = 0x0 [ 76.920713] NVRM kfspDumpDebugState_GH100: NV_PFSP_FALCON_COMMON_SCRATCH_GROUP_2(1) = 0x0 [ 76.920715] NVRM kfspDumpDebugState_GH100: NV_PFSP_FALCON_COMMON_SCRATCH_GROUP_2(2) = 0x0 [ 76.920716] NVRM kfspDumpDebugState_GH100: NV_PFSP_FALCON_COMMON_SCRATCH_GROUP_2(3) = 0x0 [ 76.920718] NVRM memdescDestroy: Destroying unfreed memory FF3004F4131C2220 [ 76.920719] NVRM memdescDestroy: Please call memdescFree() [ 76.925140] NVRM nvAssertOkFailedNoLog: Assertion failed: Failure: Generic Error [NV_ERR_GENERIC] (0x0000FFFF) returned from kfspSendBootCommands_HAL(pGpu, pKernelFsp) @ kernel_gsp_gh100.c:565 [ 76.925145] NVRM kgspInitRm_IMPL: cannot bootstrap riscv/gsp: 0xffff [ 76.925150] NVRM RmInitAdapter: Cannot initialize GSP firmware RM [ 80.331918] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x62:0xffff:1660) [ 80.334314] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0 [ 142.964696] NVRM kfspProcessCommandResponse_GH100: FSP response reported error. Task ID: 0x2 Command type: 0x14 Error code: 0x6 [ 142.964700] NVRM kfspSendBootCommands_GH100: Sent following content to FSP: [ 142.964702] NVRM kfspSendBootCommands_GH100: version=0x1, size=0x35c, gspFmcSysmemOffset=0x132c40000 [ 142.964703] NVRM kfspSendBootCommands_GH100: frtsSysmemOffset=0x132d00000, frtsSysmemSize=0x100000 [ 142.964703] NVRM kfspSendBootCommands_GH100: frtsVidmemOffset=0x200000, frtsVidmemSize=0x100000 [ 142.964704] NVRM kfspSendBootCommands_GH100: gspBootArgsSysmemOffset=0x12f67c000 [ 142.964705] NVRM kfspSendBootCommands_GH100: FSP boot cmds failed. RM cannot boot. [ 142.964707] NVRM kfspDumpDebugState_GH100: FSP microcode v4.76 [ 142.964709] NVRM kfspDumpDebugState_GH100: NV_PFSP_FALCON_COMMON_SCRATCH_GROUP_2(0) = 0x6 [ 142.964710] NVRM kfspDumpDebugState_GH100: NV_PFSP_FALCON_COMMON_SCRATCH_GROUP_2(1) = 0x0 [ 142.964712] NVRM kfspDumpDebugState_GH100: NV_PFSP_FALCON_COMMON_SCRATCH_GROUP_2(2) = 0x0 [ 142.964713] NVRM kfspDumpDebugState_GH100: NV_PFSP_FALCON_COMMON_SCRATCH_GROUP_2(3) = 0xe [ 142.964715] NVRM memdescDestroy: Destroying unfreed memory FF3004F4139E5420 [ 142.964715] NVRM memdescDestroy: Please call memdescFree() [ 142.969025] NVRM nvAssertOkFailedNoLog: Assertion failed: Failure: Generic Error [NV_ERR_GENERIC] (0x0000FFFF) returned from kfspSendBootCommands_HAL(pGpu, pKernelFsp) @ kernel_gsp_gh100.c:565 [ 142.969031] NVRM kgspInitRm_IMPL: cannot bootstrap riscv/gsp: 0xffff [ 142.969035] NVRM RmInitAdapter: Cannot initialize GSP firmware RM [ 146.397356] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x62:0xffff:1660) [ 146.399852] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0 [ 152.496696] NVRM kfspProcessCommandResponse_GH100: FSP response reported error. Task ID: 0x2 Command type: 0x14 Error code: 0x6 [ 152.496700] NVRM kfspSendBootCommands_GH100: Sent following content to FSP: [ 152.496701] NVRM kfspSendBootCommands_GH100: version=0x1, size=0x35c, gspFmcSysmemOffset=0x131d80000 [ 152.496702] NVRM kfspSendBootCommands_GH100: frtsSysmemOffset=0x132d00000, frtsSysmemSize=0x100000 [ 152.496702] NVRM kfspSendBootCommands_GH100: frtsVidmemOffset=0x200000, frtsVidmemSize=0x100000 [ 152.496703] NVRM kfspSendBootCommands_GH100: gspBootArgsSysmemOffset=0x131c7b000 [ 152.496704] NVRM kfspSendBootCommands_GH100: FSP boot cmds failed. RM cannot boot. [ 152.496706] NVRM kfspDumpDebugState_GH100: FSP microcode v4.76 [ 152.496708] NVRM kfspDumpDebugState_GH100: NV_PFSP_FALCON_COMMON_SCRATCH_GROUP_2(0) = 0x6 [ 152.496709] NVRM kfspDumpDebugState_GH100: NV_PFSP_FALCON_COMMON_SCRATCH_GROUP_2(1) = 0x0 [ 152.496711] NVRM kfspDumpDebugState_GH100: NV_PFSP_FALCON_COMMON_SCRATCH_GROUP_2(2) = 0x0 [ 152.496712] NVRM kfspDumpDebugState_GH100: NV_PFSP_FALCON_COMMON_SCRATCH_GROUP_2(3) = 0xe [ 152.496713] NVRM memdescDestroy: Destroying unfreed memory FF3004F414EA6020 [ 152.496714] NVRM memdescDestroy: Please call memdescFree() [ 152.501125] NVRM nvAssertOkFailedNoLog: Assertion failed: Failure: Generic Error [NV_ERR_GENERIC] (0x0000FFFF) returned from kfspSendBootCommands_HAL(pGpu, pKernelFsp) @ kernel_gsp_gh100.c:565 [ 152.501130] NVRM kgspInitRm_IMPL: cannot bootstrap riscv/gsp: 0xffff [ 152.501134] NVRM RmInitAdapter: Cannot initialize GSP firmware RM [ 155.909118] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x62:0xffff:1660) [ 155.911790] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0 [ 156.127383] nvidia-uvm: Loaded the UVM driver, major device number 235.

Tan-YiFan commented 9 months ago

@RodgerZhu The log Timeout waiting for lockdown release is unexpected. To check if it's a TDX issue or H100 issue, could you try non-TDX guest VM with H100 CC (on)?

jrjatin commented 9 months ago

@herozyg Encountered a SWIOTLB error when trying to load NVIDIA drivers on my CVM. Managed to resolve it by following these steps: -> Downloaded nvidia drivers from https://www.nvidia.com/download/driverResults.aspx/216530/en-us/ -> Installed the drivers using sudo sh nvidia_driver.run -m=kernel-open -> Enabled LKCA as per nvidia docs: https://docs.nvidia.com/confidential-computing-deployment-guide.pdf -> Rebooted the VM and ran sudo nvidia-persistenced Then I was able to load nvidia driver on CVM

Tan-YiFan commented 9 months ago

@jrjatin It seems that the Nvidia doc targets H100. @herozyg was attaching A10 to a CVM. Would this steps work on A10? A10 does not have confidential computing support.

NVIDIA / open-gpu-kernel-modules