NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.66k stars 988 forks source link

llama2 runs normally only on adjacent gpus #1868

Closed janpetrov closed 1 month ago

janpetrov commented 4 months ago

System Info

tensorrt-llm version 0.11.0.dev2024062500

Architecture: x86_64 AMD EPYC 9354 32-Core Processor

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA H100 PCIe               On  |   00000000:C5:00.0 Off |                    0 |
| N/A   38C    P0             50W /  350W |       1MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA H100 PCIe               On  |   00000000:C6:00.0 Off |                    0 |
| N/A   38C    P0             48W /  350W |       1MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA H100 PCIe               On  |   00000000:C9:00.0 Off |                    0 |
| N/A   38C    P0             48W /  350W |       1MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA H100 PCIe               On  |   00000000:CF:00.0 Off |                    0 |
| N/A   38C    P0             49W /  350W |       1MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
    GPU0    GPU1    GPU2    GPU3    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X  NV12    PXB PXB 32-63,96-127    1       N/A
GPU1    NV12     X  PXB PXB 32-63,96-127    1       N/A
GPU2    PXB PXB  X  NV12    32-63,96-127    1       N/A
GPU3    PXB PXB NV12     X  32-63,96-127    1       N/A

Who can help?

@nv-guomingz

Information

Tasks

Reproduction

we use llama2 70b model, no quantization

python "$SOURCE_CODE_DIR"/convert_checkpoint.py \ --model_dir "$MODEL_DIR" \ --output_dir "$TMP_DIR" \ --dtype bfloat16 \ --tp_size 4

trtllm-build \ --checkpoint_dir "$TMP_DIR" \ --gpt_attention_plugin bfloat16 \ --gemm_plugin bfloat16 \ --max_input_len 4096 \ --max_seq_len 4096 \ --max_batch_size 4 \ --workers 4 \ --output_dir "$OUTPUT_DIR"

CUDA_VISIBLE_DEVICES="0,1,2,3" python3 /app/scripts/launch_triton_server.py \ --world_size=4 \ --model_repo=/cephfs-ng/triton/llama2

Expected behavior

the model loads

actual behavior

CUDA runtime error in ::cudaMallocAsync(ptr, n, mCudaStream->get()): out of memory (/app/tensorrt_llm/cpp/tensorrt_llm/runtime/tllmBuffers.h:117)\n1 0x7eeb500750f5 void tensorrt_llm::common::check(cudaError, char const, char const, int) + 149\n2 0x7eea14e5cd74 tensorrt_llm::runtime::BufferManager::gpu(unsigned long, nvinfer1::DataType) const + 324\n3 0x7eea14f3a535 tensorrt_llm::runtime::TllmRuntime::TllmRuntime(tensorrt_llm::runtime::RawEngine const&, nvinfer1::ILogger, float, bool) + 853\n4 0x7eea1518bf12 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::TrtGptModelInflightBatching(std::shared_ptr, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::runtime::RawEngine const&, bool, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 962\n5 0x7eea151aefc4 tensorrt_llm::executor::Executor::Impl::createModel(tensorrt_llm::runtime::RawEngine const&, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::executor::ExecutorConfig const&) + 420\n6 0x7eea151af858 tensorrt_llm::executor::Executor::Impl::loadModel(std::optional const&, std::optional<std::vector<unsigned char, std::allocator > > const&, tensorrt_llm::runtime::GptJsonConfig const&, tensorrt_llm::executor::ExecutorConfig const&, bool) + 1304\n7 0x7eea151b5014 tensorrt_llm::executor::Executor::Impl::Impl(std::filesystem::path const&, std::optional const&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig const&) + 1764\n8 0x7eea151a9f50 tensorrt_llm::executor::Executor::Executor(std::filesystem::path const&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig const&) + 64\n9 0x7eeb5006f182 triton::backend::inflight_batcher_llm::ModelInstanceState::ModelInstanceState(triton::backend::inflight_batcher_llm::ModelState, TRITONBACKEND_ModelInstance) + 1538\n10 0x7eeb5006f782 triton::backend::inflight_batcher_llm::ModelInstanceState::Create(triton::backend::inflight_batcher_llm::ModelState, TRITONBACKEND_ModelInstance*, triton::backend::inflight_batcher_llm::ModelInstanceState**) + 66\n11 0x7eed673dc8f5 TRITONBACKEND_ModelInstanceInitialize + 101\n12 0x7eed6591fb0f /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1b0b0f) [0x7eed6591fb0f]\n13 0x7eed65920d57 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1b1d57) [0x7eed65920d57]\n14 0x7eed65903145 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x194145) [0x7eed65903145]\n15 0x7eed65903796 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x194796) [0x7eed65903796]\n16 0x7eed6591010d /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1a110d) [0x7eed6591010d]\n17 0x7eed64f6fee8 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x99ee8) [0x7eed64f6fee8]\n18 0x7eed658f99fb /opt/tritonserver/bin/../lib/libtritonserver.so(+0x18a9fb) [0x7eed658f99fb]\n19 0x7eed6590ad2a /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19bd2a) [0x7eed6590ad2a]\n20 0x7eed6590f55c /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1a055c) [0x7eed6590f55c]\n21 0x7eed65a04bce /opt/tritonserver/bin/../lib/libtritonserver.so(+0x295bce) [0x7eed65a04bce]\n22 0x7eed65a0818c /opt/tritonserver/bin/../lib/libtritonserver.so(+0x29918c) [0x7eed65a0818c]\n23 0x7eed65b66122 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x3f7122) [0x7eed65b66122]\n24 0x7eed651db253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7eed651db253]\n25 0x7eed64f6aac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7eed64f6aac3]\n26 0x7eed64ffc850 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x126850) [0x7eed64ffc850]"

additional notes

janpetrov commented 4 months ago

the same bug in v0.9.0 and v0.10.0 (only then we have installed 0.11.0.dev2024062500)

janpetrov commented 4 months ago

The error above occurs for all versions (v0.9.0, v0.10.0 and 0.11.0.dev2024062500) when the driver 550.90.07 is installed.

When the driver 545.23.08 is installed, all versions (v0.9.0, v0.10.0 and 0.11.0.dev2024062500) work fine.

janpetrov commented 4 months ago

The same error with 0.11.0.dev2024062500 and driver 550.90.07 on the fully accessed 8-gpus machine (both when the model is compiled with --tp_size 4 and with --tp_size 8)

Please find below the nvidia-smi topo -m (whereas the printout above was in the situation when docker image was allocated first 4 gpus only).

    GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X  NV12    NODE    NODE    SYS SYS SYS SYS 0-31    0       N/A
GPU1    NV12     X  NODE    NODE    SYS SYS SYS SYS 0-31    0       N/A
GPU2    NODE    NODE     X  NV12    SYS SYS SYS SYS 0-31    0       N/A
GPU3    NODE    NODE    NV12     X  SYS SYS SYS SYS 0-31    0       N/A
GPU4    SYS SYS SYS SYS  X  NV12    NODE    NODE    32-63   1       N/A
GPU5    SYS SYS SYS SYS NV12     X  NODE    NODE    32-63   1       N/A
GPU6    SYS SYS SYS SYS NODE    NODE     X  NV12    32-63   1       N/A
GPU7    SYS SYS SYS SYS NODE    NODE    NV12     X  32-63   1       N/A
janpetrov commented 3 months ago

The same holds for the current version 0.11.0:

The error occurs when the newer driver 550.90.07 is installed and does not occur when the older driver 545.23.08 is installed.

yuxianq commented 3 months ago

@janpetrov Do you mean this bug appears when CUDA_VISIBLE_DEVICES="0,1,2,3" or CUDA_VISIBLE_DEVICES="1,2", but it passes when CUDA_VISIBLE_DEVICES="0,1" or CUDA_VISIBLE_DEVICES="2,3"? It seems like a driver issue if all tests can pass with the older driver 545.23.08.

yuxianq commented 3 months ago

@janpetrov Thanks for your detailed explanation, I was misled by the title.

janpetrov commented 3 months ago

@yuxianq Thank you for your insightful question and please excuse my previous answer, which was misleading and I deleted it. Second serve :-):

If CUDA_VISIBLE_DEVICES="0,1,2,3" (4 gpus) (or even if we have all 8 gpus on the machine) and we infer llama 70 fp16 on 4 gpus, then the error occurs with the driver 550.90.07 (and does not occur with the driver 545.23.08).

If, however, we use two GPUs connected by a nvlink (when we infer llama 70 fp8) or even a single GPU, then no error occurs (all is fine even with the driver 550.90.07).

janpetrov commented 1 month ago

The bug has been fixed in version 0.12.

In version 0.11, if some GPUs are not mutually connected by an NVLink or SXM, we need to set use_custom_all_reduce to disable.