torch.cuda.is_available() ERROR on the guest VM

seungsoo-lee commented 8 months ago

Machine Spec.

CPU: Dual AMD EPYC 9224 16-Core Processor GPU: H100 10de:2331 (vbios: 96.00.5E.00.01 cuda: 12.2 nvidia driver: 535.86.10) Host OS: Ubuntu 22.04 with 5.19.0-rc6-snp-host-c4daeffce56e kernel Guest OS: Ubuntu 22.04.2 with 5.19.0-rc6-snp-guest-c4daeffce56e kernel

On the guest VM, CUDA, NVIDIA drivers and pytorch(pip3 install torch torchvision torchaudio) installed.

nvidia-smi (on the guest) as follows

cclab@guest:~$ nvidia-smi
Mon Jan  8 03:55:57 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.86.10              Driver Version: 535.86.10    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA H100 PCIe               On  | 00000000:01:00.0 Off |                    0 |
| N/A   32C    P0              47W / 350W |      0MiB / 81559MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

But, when I treid to run torch.cuda.is_available()

it says

>>> import torch
>>> torch.cuda.is_available()
/shared/nvAttest/lib/python3.10/site-packages/torch/cuda/__init__.py:138: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 802: system not yet initialized (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
  return torch._C._cuda_getDeviceCount() > 0

what's the problem? do you have any idea?

Tan-YiFan commented 8 months ago

Try: nvidia-smi conf-compute -grs

If the return is "not ready", then nvidia-smi conf-compute -srs 1. Then torch.cuda.is_available() should not fail.

Otherwise, could you provide the output of dmesg ?

seungsoo-lee commented 8 months ago

@Tan-YiFan

Thanks. after I set nvidia-smi conf-compute -srs 1, torch.cuda.is_available() is True.

Btw, the way to get Confidential Compute GPUs Ready state: ready is only nvidia-smi conf-compute -srs 1?

how to use attestation SDK to set it to be Ready state..?

Tan-YiFan commented 8 months ago

You can search set_gpu_ready_state in this repo. This function does the same thing as nvidia-smi conf-compute -srs [0/1].

seungsoo-lee commented 8 months ago

Oh, I was though that Confidential Compute GPUs Ready state can be ready from successfully attestation the GPU by using attestation SDK, not by statically setting.

Then, did you run the k8s workloads successfully in your k8s cluster? #36

Tan-YiFan commented 8 months ago

I have never run k8s workloads on H100 CC.

seungsoo-lee commented 8 months ago

@Tan-YiFan

Then, in your case, how to do confidential computing workloads in the guest VM?

could you let me know?

Tan-YiFan commented 8 months ago

Users have root access to the guest VM. So containers are not used.

rnertney commented 3 months ago

If you wish to use containers, please reference this guide for now.

NVIDIA / nvtrust

torch.cuda.is_available() ERROR on the guest VM #39