NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.29k stars 924 forks source link

No free blocks left when building using larger max_batch_size #694

Open sleepwalker2017 opened 9 months ago

sleepwalker2017 commented 9 months ago

I'm testing gptq on 2*A30, and find something strange.

When I build model with max_batch_size = 64, it's ok to run with batch = 64, input = 32, output = 96.

[BENCHMARK] batch_size 48 input_length 32 output_length 96 latency(ms) 2455.45 tokensPerSec 1876.65
Benchmarking done. Iteration: 25, duration: 61.39 sec.
Benchmarking done. Iteration: 21, duration: 60.05 sec.
[BENCHMARK] batch_size 64 input_length 32 output_length 96 latency(ms) 2859.64 tokensPerSec 2148.52

But when I build model with max_batch_size = 128, it fails to run batch = 64, input = 32, output = 96. Here is the error message. Why is that?

[TensorRT-LLM][ERROR] [TensorRT-LLM][ERROR] Assertion failed: Can't allocate new blocks. No free blocks left. (/home/jenkins/agent/workspace/LLM/main/L0_PostMerge/llm/cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:362)
1       0x557282845de2 tensorrt_llm::common::throwRuntimeError(char const*, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 100
2       0x7f5884df817a tensorrt_llm::batch_manager::kv_cache_manager::BlockManager::allocateBlock(tensorrt_llm::batch_manager::kv_cache_manager::GenerationRequest&, bool) + 954
3       0x7f5884df90d3 tensorrt_llm::batch_manager::kv_cache_manager::KVCacheManager::addSequence(int, int, int, std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> const&) + 435
4       0x7f5884da3d81 tensorrt_llm::runtime::GptSession::kvCacheAddSequences(int, int, int) + 257
5       0x7f5884daab22 tensorrt_llm::runtime::GptSession::generateBatched(std::vector<tensorrt_llm::runtime::GenerationOutput, std::allocator<tensorrt_llm::runtime::GenerationOutput> >&, std::vector<tensorrt_llm::runtime::GenerationInput, std::allocator<tensorrt_llm::runtime::GenerationInput> > const&, tensorrt_llm::runtime::SamplingConfig const&, std::function<void (int, bool)> const&) + 962
6       0x7f5884dac561 tensorrt_llm::runtime::GptSession::generate(tensorrt_llm::runtime::GenerationOutput&, tensorrt_llm::runtime::GenerationInput const&, tensorrt_llm::runtime::SamplingConfig const&) + 3105
7       0x55728284b0de benchmarks/gptSessionBenchmark(+0x1a0de) [0x55728284b0de]
8       0x7f587354fd90 /lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7f587354fd90]
9       0x7f587354fe40 __libc_start_main + 128
10      0x55728284cf75 benchmarks/gptSessionBenchmark(+0x1bf75) [0x55728284cf75]

Here is my build command:

python build.py --model_dir /data/vicuna-13b/vicuna-13b-v1.5/ \
                --quant_ckpt_path gptq_tensor/llama-13b-4bit-gs128.safetensors \
                --dtype float16 \
                --remove_input_padding \
                --use_gpt_attention_plugin float16 \
                --use_gemm_plugin float16 \
                --use_weight_only \
                --weight_only_precision int4_gptq \
                --per_group \
                --output_dir ./tmp/llama/13B/trt_engines/int4_GPTQ/2-gpu/ \
                --paged_kv_cache \
                --world_size 2 \
                --tp_size 2 \
                --enable_context_fmha \
                --parallel_build \
                --max_batch_size 128
kaiyux commented 9 months ago

When you use larger batch size, TensorRT requires more activation workspace, and it would lead to less free memory in runtime.