Closed janpetrov closed 1 month ago
the same bug in v0.9.0 and v0.10.0 (only then we have installed 0.11.0.dev2024062500)
The error above occurs for all versions (v0.9.0, v0.10.0 and 0.11.0.dev2024062500) when the driver 550.90.07 is installed.
When the driver 545.23.08 is installed, all versions (v0.9.0, v0.10.0 and 0.11.0.dev2024062500) work fine.
The same error with 0.11.0.dev2024062500 and driver 550.90.07 on the fully accessed 8-gpus machine (both when the model is compiled with --tp_size 4 and with --tp_size 8)
Please find below the nvidia-smi topo -m
(whereas the printout above was in the situation when docker image was allocated first 4 gpus only).
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV12 NODE NODE SYS SYS SYS SYS 0-31 0 N/A
GPU1 NV12 X NODE NODE SYS SYS SYS SYS 0-31 0 N/A
GPU2 NODE NODE X NV12 SYS SYS SYS SYS 0-31 0 N/A
GPU3 NODE NODE NV12 X SYS SYS SYS SYS 0-31 0 N/A
GPU4 SYS SYS SYS SYS X NV12 NODE NODE 32-63 1 N/A
GPU5 SYS SYS SYS SYS NV12 X NODE NODE 32-63 1 N/A
GPU6 SYS SYS SYS SYS NODE NODE X NV12 32-63 1 N/A
GPU7 SYS SYS SYS SYS NODE NODE NV12 X 32-63 1 N/A
The same holds for the current version 0.11.0:
The error occurs when the newer driver 550.90.07 is installed and does not occur when the older driver 545.23.08 is installed.
@janpetrov Do you mean this bug appears when CUDA_VISIBLE_DEVICES="0,1,2,3"
or CUDA_VISIBLE_DEVICES="1,2"
, but it passes when CUDA_VISIBLE_DEVICES="0,1"
or CUDA_VISIBLE_DEVICES="2,3"
? It seems like a driver issue if all tests can pass with the older driver 545.23.08.
@janpetrov Thanks for your detailed explanation, I was misled by the title.
@yuxianq Thank you for your insightful question and please excuse my previous answer, which was misleading and I deleted it. Second serve :-):
If CUDA_VISIBLE_DEVICES="0,1,2,3" (4 gpus) (or even if we have all 8 gpus on the machine) and we infer llama 70 fp16 on 4 gpus, then the error occurs with the driver 550.90.07 (and does not occur with the driver 545.23.08).
If, however, we use two GPUs connected by a nvlink (when we infer llama 70 fp8) or even a single GPU, then no error occurs (all is fine even with the driver 550.90.07).
The bug has been fixed in version 0.12.
In version 0.11, if some GPUs are not mutually connected by an NVLink or SXM, we need to set use_custom_all_reduce
to disable
.
System Info
tensorrt-llm version 0.11.0.dev2024062500
Architecture: x86_64 AMD EPYC 9354 32-Core Processor
Who can help?
@nv-guomingz
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
we use llama2 70b model, no quantization
python "$SOURCE_CODE_DIR"/convert_checkpoint.py \ --model_dir "$MODEL_DIR" \ --output_dir "$TMP_DIR" \ --dtype bfloat16 \ --tp_size 4
trtllm-build \ --checkpoint_dir "$TMP_DIR" \ --gpt_attention_plugin bfloat16 \ --gemm_plugin bfloat16 \ --max_input_len 4096 \ --max_seq_len 4096 \ --max_batch_size 4 \ --workers 4 \ --output_dir "$OUTPUT_DIR"
CUDA_VISIBLE_DEVICES="0,1,2,3" python3 /app/scripts/launch_triton_server.py \ --world_size=4 \ --model_repo=/cephfs-ng/triton/llama2
Expected behavior
the model loads
actual behavior
CUDA runtime error in ::cudaMallocAsync(ptr, n, mCudaStream->get()): out of memory (/app/tensorrt_llm/cpp/tensorrt_llm/runtime/tllmBuffers.h:117)\n1 0x7eeb500750f5 void tensorrt_llm::common::check(cudaError, char const, char const, int) + 149\n2 0x7eea14e5cd74 tensorrt_llm::runtime::BufferManager::gpu(unsigned long, nvinfer1::DataType) const + 324\n3 0x7eea14f3a535 tensorrt_llm::runtime::TllmRuntime::TllmRuntime(tensorrt_llm::runtime::RawEngine const&, nvinfer1::ILogger, float, bool) + 853\n4 0x7eea1518bf12 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::TrtGptModelInflightBatching(std::shared_ptr, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::runtime::RawEngine const&, bool, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 962\n5 0x7eea151aefc4 tensorrt_llm::executor::Executor::Impl::createModel(tensorrt_llm::runtime::RawEngine const&, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::executor::ExecutorConfig const&) + 420\n6 0x7eea151af858 tensorrt_llm::executor::Executor::Impl::loadModel(std::optional const&, std::optional<std::vector<unsigned char, std::allocator > > const&, tensorrt_llm::runtime::GptJsonConfig const&, tensorrt_llm::executor::ExecutorConfig const&, bool) + 1304\n7 0x7eea151b5014 tensorrt_llm::executor::Executor::Impl::Impl(std::filesystem::path const&, std::optional const&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig const&) + 1764\n8 0x7eea151a9f50 tensorrt_llm::executor::Executor::Executor(std::filesystem::path const&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig const&) + 64\n9 0x7eeb5006f182 triton::backend::inflight_batcher_llm::ModelInstanceState::ModelInstanceState(triton::backend::inflight_batcher_llm::ModelState , TRITONBACKEND_ModelInstance) + 1538\n10 0x7eeb5006f782 triton::backend::inflight_batcher_llm::ModelInstanceState::Create(triton::backend::inflight_batcher_llm::ModelState, TRITONBACKEND_ModelInstance*, triton::backend::inflight_batcher_llm::ModelInstanceState**) + 66\n11 0x7eed673dc8f5 TRITONBACKEND_ModelInstanceInitialize + 101\n12 0x7eed6591fb0f /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1b0b0f) [0x7eed6591fb0f]\n13 0x7eed65920d57 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1b1d57) [0x7eed65920d57]\n14 0x7eed65903145 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x194145) [0x7eed65903145]\n15 0x7eed65903796 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x194796) [0x7eed65903796]\n16 0x7eed6591010d /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1a110d) [0x7eed6591010d]\n17 0x7eed64f6fee8 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x99ee8) [0x7eed64f6fee8]\n18 0x7eed658f99fb /opt/tritonserver/bin/../lib/libtritonserver.so(+0x18a9fb) [0x7eed658f99fb]\n19 0x7eed6590ad2a /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19bd2a) [0x7eed6590ad2a]\n20 0x7eed6590f55c /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1a055c) [0x7eed6590f55c]\n21 0x7eed65a04bce /opt/tritonserver/bin/../lib/libtritonserver.so(+0x295bce) [0x7eed65a04bce]\n22 0x7eed65a0818c /opt/tritonserver/bin/../lib/libtritonserver.so(+0x29918c) [0x7eed65a0818c]\n23 0x7eed65b66122 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x3f7122) [0x7eed65b66122]\n24 0x7eed651db253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7eed651db253]\n25 0x7eed64f6aac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7eed64f6aac3]\n26 0x7eed64ffc850 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x126850) [0x7eed64ffc850]"
additional notes