Open edesalve opened 5 days ago
what's the output of nvidia-smi under your env?
@nv-guomingz I pasted the output of nvidia-smi -q
in the first message, in the following the output of nvidia-smi
:
Fri Jun 28 10:31:44 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.42.02 Driver Version: 555.42.02 CUDA Version: 12.5 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A100 80GB PCIe Off | 00000000:03:00.0 Off | 0 |
| N/A 30C P0 42W / 300W | 1MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
Thanks @edesalve We'll try to reproduce your issue internally but it more looks like a issue related with nvml package.
@yuxianq could we war this issue?
@edesalve Some nvml API is unavailable in vGPU environment due to safety. We have already workaround it by fallbacking to a default cluster key when any nvml API fails. This bugfix will be provided in the next weekly release of main branch.
Thank you for the very quick response, I will wait for next week's version!
System Info
Who can help?
@byshiue
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
After proper checkpoint creation:
Expected behavior
Successful build of the engine.
actual behavior
additional notes
The same procedure has already been successfully completed on a VM with GPU
Passthrough
. The configuration of the VM with vGPU technology has been done correctly (details of the installed drivers at the bottom). The problem arises during the execution ofinfer_cluster_info
. I created a small test in python to go and check all the requests made in the function:and this is the output:
This despite setting
pciPassthru0.cfg.enable_profiling
for the VM as suggested in the NVIDIA AI Enterprise User Guide. Is there something I'm missing or are vGPU simply not supported?