NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
7.37k stars 796 forks source link

trtllm-build on GRID vGPU - nvml errors #1861

Open edesalve opened 5 days ago

edesalve commented 5 days ago

System Info

Who can help?

@byshiue

Information

Tasks

Reproduction

After proper checkpoint creation:

trtllm-build --checkpoint_dir Meta-Llama-3-8B-Instruct-bf16-ckpt --gemm_plugin bfloat16 --gpt_attention_plugin bfloat16 --output_dir Meta-Llama-3-8B-Instruct-bf16-engine

Expected behavior

Successful build of the engine.

actual behavior

[TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024062500
[06/28/2024-06:35:05] [TRT-LLM] [I] Set bert_attention_plugin to auto.
[06/28/2024-06:35:05] [TRT-LLM] [I] Set gpt_attention_plugin to bfloat16.
[06/28/2024-06:35:05] [TRT-LLM] [I] Set gemm_plugin to bfloat16.
[06/28/2024-06:35:05] [TRT-LLM] [I] Set gemm_swiglu_plugin to None.
[06/28/2024-06:35:05] [TRT-LLM] [I] Set nccl_plugin to auto.
[06/28/2024-06:35:05] [TRT-LLM] [I] Set lookup_plugin to None.
[06/28/2024-06:35:05] [TRT-LLM] [I] Set lora_plugin to None.
[06/28/2024-06:35:05] [TRT-LLM] [I] Set moe_plugin to auto.
[06/28/2024-06:35:05] [TRT-LLM] [I] Set mamba_conv1d_plugin to auto.
[06/28/2024-06:35:05] [TRT-LLM] [I] Set context_fmha to True.
[06/28/2024-06:35:05] [TRT-LLM] [I] Set context_fmha_fp32_acc to False.
[06/28/2024-06:35:05] [TRT-LLM] [I] Set paged_kv_cache to True.
[06/28/2024-06:35:05] [TRT-LLM] [I] Set remove_input_padding to True.
[06/28/2024-06:35:05] [TRT-LLM] [I] Set use_custom_all_reduce to True.
[06/28/2024-06:35:05] [TRT-LLM] [I] Set reduce_fusion to False.
[06/28/2024-06:35:05] [TRT-LLM] [I] Set multi_block_mode to False.
[06/28/2024-06:35:05] [TRT-LLM] [I] Set enable_xqa to True.
[06/28/2024-06:35:05] [TRT-LLM] [I] Set attention_qk_half_accumulation to False.
[06/28/2024-06:35:05] [TRT-LLM] [I] Set tokens_per_block to 64.
[06/28/2024-06:35:05] [TRT-LLM] [I] Set use_paged_context_fmha to False.
[06/28/2024-06:35:05] [TRT-LLM] [I] Set use_fp8_context_fmha to False.
[06/28/2024-06:35:05] [TRT-LLM] [I] Set multiple_profiles to False.
[06/28/2024-06:35:05] [TRT-LLM] [I] Set paged_state to True.
[06/28/2024-06:35:05] [TRT-LLM] [I] Set streamingllm to False.
[06/28/2024-06:35:05] [TRT-LLM] [I] Compute capability: (8, 0)
[06/28/2024-06:35:05] [TRT-LLM] [I] SM count: 98
Traceback (most recent call last):
  File "/home/s2e/.local/bin/trtllm-build", line 8, in <module>
    sys.exit(main())
  File "/home/s2e/.local/lib/python3.10/site-packages/tensorrt_llm/commands/buil                                                                                                                                                             d.py", line 428, in main
    cluster_config = infer_cluster_config()
  File "/home/s2e/.local/lib/python3.10/site-packages/tensorrt_llm/auto_parallel                                                                                                                                                             /cluster_info.py", line 538, in infer_cluster_config
    cluster_info=infer_cluster_info(),
  File "/home/s2e/.local/lib/python3.10/site-packages/tensorrt_llm/auto_parallel                                                                                                                                                             /cluster_info.py", line 460, in infer_cluster_info
    sm_clock = pynvml.nvmlDeviceGetMaxClockInfo(
  File "/usr/local/lib/python3.10/dist-packages/pynvml/nvml.py", line 2182, in n                                                                                                                                                             vmlDeviceGetMaxClockInfo
    _nvmlCheckReturn(ret)
  File "/usr/local/lib/python3.10/dist-packages/pynvml/nvml.py", line 833, in _n                                                                                                                                                             vmlCheckReturn
    raise NVMLError(ret)
pynvml.nvml.NVMLError_NotSupported: Not Supported

additional notes

The same procedure has already been successfully completed on a VM with GPU Passthrough. The configuration of the VM with vGPU technology has been done correctly (details of the installed drivers at the bottom). The problem arises during the execution of infer_cluster_info. I created a small test in python to go and check all the requests made in the function:

import pynvml
import torch
import sys

# Initialize NVML
pynvml.nvmlInit()

def test_nvml_calls():
    device_index = torch.cuda.current_device()
    handle = pynvml.nvmlDeviceGetHandleByIndex(device_index)

    # List of NVML functions to test
    nvml_functions = [
        ("nvmlDeviceGetCudaComputeCapability", lambda: pynvml.nvmlDeviceGetCudaComputeCapability(handle)),
        ("nvmlDeviceGetMaxClockInfo for SM Clock", lambda: pynvml.nvmlDeviceGetMaxClockInfo(handle, pynvml.NVML_CLOCK_SM)),
        ("nvmlDeviceGetMaxClockInfo for Mem Clock", lambda: pynvml.nvmlDeviceGetMaxClockInfo(handle, pynvml.NVML_CLOCK_MEM)),
        ("nvmlDeviceGetMemoryBusWidth", lambda: pynvml.nvmlDeviceGetMemoryBusWidth(handle)),
        ("nvmlDeviceGetNvLinkState", lambda: pynvml.nvmlDeviceGetNvLinkState(handle, 0)),
        ("nvmlDeviceGetNvLinkVersion", lambda: pynvml.nvmlDeviceGetNvLinkVersion(handle, 0)),
        ("nvmlDeviceGetCurrPcieLinkGeneration", lambda: pynvml.nvmlDeviceGetCurrPcieLinkGeneration(handle)),
        ("nvmlDeviceGetCurrPcieLinkWidth", lambda: pynvml.nvmlDeviceGetCurrPcieLinkWidth(handle)),
    ]

    results = {}
    for func_name, func in nvml_functions:
        try:
            result = func()
            results[func_name] = ("Success", result)
        except pynvml.NVMLError as e:
            results[func_name] = ("Failed", str(e))

    return results

# Run the tests and print the results
results = test_nvml_calls()
for func_name, (status, output) in results.items():
    print(f"{func_name}: {status} - Output/Error: {output}")

# Finalize NVML
pynvml.nvmlShutdown()

and this is the output:

nvmlDeviceGetCudaComputeCapability: Success - Output/Error: (8, 0)
nvmlDeviceGetMaxClockInfo for SM Clock: Failed - Output/Error: Not Supported
nvmlDeviceGetMaxClockInfo for Mem Clock: Failed - Output/Error: Not Supported
nvmlDeviceGetMemoryBusWidth: Success - Output/Error: 5120
nvmlDeviceGetNvLinkState: Failed - Output/Error: Not Supported
nvmlDeviceGetNvLinkVersion: Failed - Output/Error: Not Supported
nvmlDeviceGetCurrPcieLinkGeneration: Failed - Output/Error: Not Supported
nvmlDeviceGetCurrPcieLinkWidth: Failed - Output/Error: Not Supported

This despite setting pciPassthru0.cfg.enable_profiling for the VM as suggested in the NVIDIA AI Enterprise User Guide. Is there something I'm missing or are vGPU simply not supported?

nvidia-smi -q

==============NVSMI LOG==============

Timestamp                                 : Fri Jun 28 07:03:27 2024
Driver Version                            : 550.54.15
CUDA Version                              : 12.4

Attached GPUs                             : 1
GPU 00000000:02:00.0
    Product Name                          : GRID A100D-7-80C
    Product Brand                         : NVIDIA Virtual Compute Server
    Product Architecture                  : Ampere
    Display Mode                          : Enabled
    Display Active                        : Disabled
    Persistence Mode                      : Enabled
    Addressing Mode                       : None
    MIG Mode
        Current                           : Enabled
        Pending                           : Enabled
    MIG Device
        Index                             : 0
        GPU Instance ID                   : 0
        Compute Instance ID               : 0
        Device Attributes
            Shared
                Multiprocessor count      : 98
                Copy Engine count         : 7
                Encoder count             : 0
                Decoder count             : 5
                OFA count                 : 1
                JPG count                 : 1
        ECC Errors
            Volatile
                SRAM Uncorrectable        : 0
        FB Memory Usage
            Total                         : 76011 MiB
            Reserved                      : 0 MiB
            Used                          : 0 MiB
            Free                          : 76011 MiB
        BAR1 Memory
            Total                         : 4096 MiB
            Used                          : 0 MiB
            Free                          : 4095 MiB
    Accounting Mode                       : Disabled
    Accounting Mode Buffer Size           : 4000
    Driver Model
        Current                           : N/A
        Pending                           : N/A
    Serial Number                         : N/A
    GPU UUID                              : GPU-be04bb87-20dd-11b2-bae1-30170d8796ab
    Minor Number                          : 0
    VBIOS Version                         : 00.00.00.00.00
    MultiGPU Board                        : No
    Board ID                              : 0x200
    Board Part Number                     : N/A
    GPU Part Number                       : 20B5-893-A1
    FRU Part Number                       : N/A
    Module ID                             : N/A
    Inforom Version
        Image Version                     : N/A
        OEM Object                        : N/A
        ECC Object                        : N/A
        Power Management Object           : N/A
    Inforom BBX Object Flush
        Latest Timestamp                  : N/A
        Latest Duration                   : N/A
    GPU Operation Mode
        Current                           : N/A
        Pending                           : N/A
    GPU C2C Mode                          : N/A
    GPU Virtualization Mode
        Virtualization Mode               : VGPU
        Host VGPU Mode                    : N/A
        vGPU Heterogeneous Mode           : N/A
    vGPU Software Licensed Product
        Product Name                      : NVIDIA Virtual Compute Server
        License Status                    : Licensed (Expiry: 2024-6-29 3:6:29 GMT)
    GPU Reset Status
        Reset Required                    : N/A
        Drain and Reset Recommended       : N/A
    GSP Firmware Version                  : N/A
    IBMNPU
        Relaxed Ordering Mode             : N/A
    PCI
        Bus                               : 0x02
        Device                            : 0x00
        Domain                            : 0x0000
        Base Classcode                    : 0x3
        Sub Classcode                     : 0x2
        Device Id                         : 0x20B510DE
        Bus Id                            : 00000000:02:00.0
        Sub System Id                     : 0x159510DE
        GPU Link Info
            PCIe Generation
                Max                       : N/A
                Current                   : N/A
                Device Current            : N/A
                Device Max                : N/A
                Host Max                  : N/A
            Link Width
                Max                       : N/A
                Current                   : N/A
        Bridge Chip
            Type                          : N/A
            Firmware                      : N/A
        Replays Since Reset               : N/A
        Replay Number Rollovers           : N/A
        Tx Throughput                     : N/A
        Rx Throughput                     : N/A
        Atomic Caps Inbound               : N/A
        Atomic Caps Outbound              : N/A
    Fan Speed                             : N/A
    Performance State                     : P0
    Clocks Event Reasons                  : N/A
    Sparse Operation Mode                 : N/A
    FB Memory Usage
        Total                             : 81920 MiB
        Reserved                          : 5908 MiB
        Used                              : 0 MiB
        Free                              : 76011 MiB
    BAR1 Memory Usage
        Total                             : 4096 MiB
        Used                              : 0 MiB
        Free                              : 4096 MiB
    Conf Compute Protected Memory Usage
        Total                             : 0 MiB
        Used                              : 0 MiB
        Free                              : 0 MiB
    Compute Mode                          : Default
    Utilization
        Gpu                               : N/A
        Memory                            : N/A
        Encoder                           : N/A
        Decoder                           : N/A
        JPEG                              : N/A
        OFA                               : N/A
    Encoder Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    FBC Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    ECC Mode
        Current                           : Enabled
        Pending                           : Enabled
    ECC Errors
        Volatile
            SRAM Correctable              : N/A
            SRAM Uncorrectable Parity     : N/A
            SRAM Uncorrectable SEC-DED    : N/A
            DRAM Correctable              : 0
            DRAM Uncorrectable            : 0
        Aggregate
            SRAM Correctable              : N/A
            SRAM Uncorrectable Parity     : N/A
            SRAM Uncorrectable SEC-DED    : N/A
            DRAM Correctable              : 0
            DRAM Uncorrectable            : 0
            SRAM Threshold Exceeded       : N/A
        Aggregate Uncorrectable SRAM Sources
            SRAM L2                       : N/A
            SRAM SM                       : N/A
            SRAM Microcontroller          : N/A
            SRAM PCIE                     : N/A
            SRAM Other                    : N/A
    Retired Pages
        Single Bit ECC                    : N/A
        Double Bit ECC                    : N/A
        Pending Page Blacklist            : N/A
    Remapped Rows                         : N/A
    Temperature
        GPU Current Temp                  : N/A
        GPU T.Limit Temp                  : N/A
        GPU Shutdown Temp                 : N/A
        GPU Slowdown Temp                 : N/A
        GPU Max Operating Temp            : N/A
        GPU Target Temperature            : N/A
        Memory Current Temp               : N/A
        Memory Max Operating Temp         : N/A
    GPU Power Readings
        Power Draw                        : N/A
        Current Power Limit               : N/A
        Requested Power Limit             : N/A
        Default Power Limit               : N/A
        Min Power Limit                   : N/A
        Max Power Limit                   : N/A
    GPU Memory Power Readings
        Power Draw                        : N/A
    Module Power Readings
        Power Draw                        : N/A
        Current Power Limit               : N/A
        Requested Power Limit             : N/A
        Default Power Limit               : N/A
        Min Power Limit                   : N/A
        Max Power Limit                   : N/A
    Clocks
        Graphics                          : 1410 MHz
        SM                                : 1410 MHz
        Memory                            : 1512 MHz
        Video                             : 1275 MHz
    Applications Clocks
        Graphics                          : N/A
        Memory                            : N/A
    Default Applications Clocks
        Graphics                          : N/A
        Memory                            : N/A
    Deferred Clocks
        Memory                            : N/A
    Max Clocks
        Graphics                          : N/A
        SM                                : N/A
        Memory                            : N/A
        Video                             : N/A
    Max Customer Boost Clocks
        Graphics                          : N/A
    Clock Policy
        Auto Boost                        : N/A
        Auto Boost Default                : N/A
    Voltage
        Graphics                          : N/A
    Fabric
        State                             : N/A
        Status                            : N/A
        CliqueId                          : N/A
        ClusterUUID                       : N/A
        Health
            Bandwidth                     : N/A
    Processes                             : None
nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Tue_Feb_27_16:19:38_PST_2024
Cuda compilation tools, release 12.4, V12.4.99
Build cuda_12.4.r12.4/compiler.33961263_0
nv-guomingz commented 5 days ago

what's the output of nvidia-smi under your env?

edesalve commented 5 days ago

@nv-guomingz I pasted the output of nvidia-smi -q in the first message, in the following the output of nvidia-smi:

Fri Jun 28 10:31:44 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.42.02              Driver Version: 555.42.02      CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100 80GB PCIe          Off |   00000000:03:00.0 Off |                    0 |
| N/A   30C    P0             42W /  300W |       1MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
nv-guomingz commented 4 days ago

Thanks @edesalve We'll try to reproduce your issue internally but it more looks like a issue related with nvml package.

@yuxianq could we war this issue?

yuxianq commented 4 days ago

@edesalve Some nvml API is unavailable in vGPU environment due to safety. We have already workaround it by fallbacking to a default cluster key when any nvml API fails. This bugfix will be provided in the next weekly release of main branch.

edesalve commented 4 days ago

Thank you for the very quick response, I will wait for next week's version!