NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
7.34k stars 794 forks source link

gptSessionBenchmark Failed Because of " Assertion failed: key_size <= remaining_buffer_size " #1808

Closed Hongbosherlock closed 1 week ago

Hongbosherlock commented 1 week ago

System Info

ubuntu 20.04 tensorrt 10.0.1 tensorrt-cu12 10.0.1 tensorrt-cu12-bindings 10.0.1 tensorrt-cu12-libs 10.0.1 tensorrt-llm 0.11.0.dev2024061100

nvidia L40s

Who can help?

@byshiue @hijkzzz

Information

Tasks

Reproduction

  1. quantize
    python examples/quantization/quantize.py --model_dir /target/model/llama3_8B \
                                       --dtype float16 \
                                       --qformat w4a8_awq \
                                       --awq_block_size 128 \
                                       --output_dir /target/model/quantized_w4a8-awq \
                                       --calib_size 32

    2.trtllm build

    trtllm-build    --checkpoint_dir /target/model/quantized_w4a8-awq \
                 --output_dir /target/model/trt_engines/w4a8_AWQ/1-gpu/ \
                 --gemm_plugin float16 \
                --gpt_attention_plugin float16 \
                --context_fmha enable \
                --remove_input_padding enable \
                --paged_kv_cache enable \
                --max_batch_size 50 \
                --max_input_len 3000 \
                --max_output_len 3000

3.benchmark

./benchmarks/gptSessionBenchmark \
    --engine_dir "/target/model/trt_engines/w4a8_AWQ/1-gpu/" \
    --batch_size "1" \
    --input_output_len "60,20"

Expected behavior

get benchmark result

actual behavior

got errors:

[TensorRT-LLM][ERROR] tensorrt_llm::common::TllmException: [TensorRT-LLM][ERROR] Assertion failed: key_size <= remaining_buffer_size (/target/cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderXQAImplJIT/cubinObjRegistry.h:49)
1       0x5584e5f4b373 tensorrt_llm::common::throwRuntimeError(char const*, int, std::string const&) + 102
2       0x7f0e64a684b7 tensorrt_llm::kernels::jit::CubinObjRegistryTemplate<tensorrt_llm::kernels::XQAKernelFullHashKey, tensorrt_llm::kernels::XQAKernelFullHasher>::CubinObjRegistryTemplate(void const*, unsigned long) + 1639
3       0x7f0e64a674f2 tensorrt_llm::kernels::DecoderXQARunner::Resource::Resource(void const*, unsigned long) + 50
4       0x7f0e9ad778f9 tensorrt_llm::plugins::GPTAttentionPluginCommon::GPTAttentionPluginCommon(void const*, unsigned long) + 873
5       0x7f0e9ad960d3 tensorrt_llm::plugins::GPTAttentionPlugin::GPTAttentionPlugin(void const*, unsigned long) + 19
6       0x7f0e9ad96152 tensorrt_llm::plugins::GPTAttentionPluginCreator::deserializePlugin(char const*, void const*, unsigned long) + 50
7       0x7f0e1493f102 /usr/local/tensorrt/lib/libnvinfer.so.10(+0x1066102) [0x7f0e1493f102]
8       0x7f0e1493a1de /usr/local/tensorrt/lib/libnvinfer.so.10(+0x10611de) [0x7f0e1493a1de]
9       0x7f0e148b5177 /usr/local/tensorrt/lib/libnvinfer.so.10(+0xfdc177) [0x7f0e148b5177]
10      0x7f0e148b33fe /usr/local/tensorrt/lib/libnvinfer.so.10(+0xfda3fe) [0x7f0e148b33fe]
11      0x7f0e148cbf27 /usr/local/tensorrt/lib/libnvinfer.so.10(+0xff2f27) [0x7f0e148cbf27]
12      0x7f0e148cee7d /usr/local/tensorrt/lib/libnvinfer.so.10(+0xff5e7d) [0x7f0e148cee7d]
13      0x7f0e148cf3b4 /usr/local/tensorrt/lib/libnvinfer.so.10(+0xff63b4) [0x7f0e148cf3b4]
14      0x7f0e148fe64f /usr/local/tensorrt/lib/libnvinfer.so.10(+0x102564f) [0x7f0e148fe64f]
15      0x7f0e148ff3f5 /usr/local/tensorrt/lib/libnvinfer.so.10(+0x10263f5) [0x7f0e148ff3f5]
16      0x7f0e148ff489 /usr/local/tensorrt/lib/libnvinfer.so.10(+0x1026489) [0x7f0e148ff489]
17      0x7f0e666a8a68 tensorrt_llm::runtime::TllmRuntime::TllmRuntime(void const*, unsigned long, float, nvinfer1::ILogger&) + 504
18      0x7f0e66653db6 tensorrt_llm::runtime::GptSession::GptSession(tensorrt_llm::runtime::GptSession::Config const&, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, void const*, unsigned long, std::shared_ptr<nvinfer1::ILogger>) + 1126
19      0x5584e5f4fce0 ./benchmarks/gptSessionBenchmark(+0x1dce0) [0x5584e5f4fce0]
20      0x7f0e5f5edd90 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7f0e5f5edd90]
21      0x7f0e5f5ede40 __libc_start_main + 128
22      0x5584e5f533c5 ./benchmarks/gptSessionBenchmark(+0x213c5) [0x5584e5f533c5]

additional notes

run

python3 run.py --max_output_len=500 \
               --tokenizer_dir=/target/model/llama3_8B \
               --engine_dir=/target/model/trt_engines/w4a8_AWQ/1-gpu/

result:

Input [Text 0]: "Born in north-east France, Soyer trained as a"
Output [Text 0 Beam 0]: " painter in Paris, where he was a pupil of the French artist Paul Delaroche. He was a member of the Société des Artistes Français and exhibited at the Paris Salon. He was also a member of the Société des Pastellistes Français. He was a friend of the French artist Jean-Léon Gérôme......

It seems that the engine and docker env are ok.

hijkzzz commented 1 week ago

Hi please try pip install tensorrt_llm==0.11.0.dev2024061800 and use the container: nvcr.io/nvidia/tritonserver:24.05-trtllm-python-py3 Or build the TRT-LLM container using make -C docker release_build (recommend)

It works well on my side:

| NVIDIA-SMI 550.54.14              Driver Version: 550.54.14      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L40S                    On  |   00000000:01:00.0 Off |                    0 |
| N/A   26C    P8             32W /  350W |       0MiB /  46068MiB |      0%      Default |
|                                         |                        |                  N/A 

jianh@cd56de89319f:/tensorrtllm$ ./cpp/build/benchmarks/gptSessionBenchmark  --engine_dir ./tmp/llama3-8b-awq-engine/ --batch_size "1" --input_output_len "60,20"
Benchmarking done. Iteration: 10, duration: 1.50 sec.
Latencies: [149.56, 149.60, 149.64, 149.48, 151.35, 149.47, 149.45, 149.42, 149.47, 149.59]
[BENCHMARK] batch_size 1 input_length 60 output_length 20 latency(ms) 149.70 tokensPerSec 133.60 generation_time(ms) 139.96 generationTokensPerSec 142.89 gpu_peak_mem(gb) 43.61

Thanks

Hongbosherlock commented 1 week ago

please try pip install tensorrt_llm==0.11.0.dev2024061800 and use the container: nvcr.io/nvidia/tritonserver:24.05-trtllm-python-py3

How can I reproduce your result? just run this docker and pip install tensorrt_llmis not enough before I could successfully build a engine.

Hongbosherlock commented 1 week ago

@hijkzzz It didn't work for me, did I miss anything? Please take a look for me. Thanks!

  1. run the docker
    docker  run -it --shm-size 200G --gpus all --network=host --cap-add=SYS_ADMIN --name nv_fp8 -v ${PWD}:/target  nvcr.io/nvidia/tritonserver:24.05-trtllm-python-py3
  2. install tensorrt_llm and tensorrt
    pip install tensorrt_llm==0.11.0.dev2024061800
    sh install_tensorrt.sh
  3. Check installation
    python3 -c "import tensorrt_llm"

    I got errors:

 python3 -c "import tensorrt_llm"
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/transformers/utils/import_utils.py", line 1535, in _get_module
    return importlib.import_module("." + module_name, self.__name__)
  File "/usr/lib/python3.10/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py", line 97, in <module>
    from accelerate.hooks import AlignDevicesHook, add_hook_to_module
  File "/usr/local/lib/python3.10/dist-packages/accelerate/__init__.py", line 16, in <module>
    from .accelerator import Accelerator
  File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 35, in <module>
    from .checkpointing import load_accelerator_state, load_custom_state, save_accelerator_state, save_custom_state
  File "/usr/local/lib/python3.10/dist-packages/accelerate/checkpointing.py", line 24, in <module>
    from .utils import (
  File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/__init__.py", line 182, in <module>
    from .bnb import has_4bit_bnb_layers, load_and_quantize_model
  File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/bnb.py", line 29, in <module>
    from ..big_modeling import dispatch_model, init_empty_weights
  File "/usr/local/lib/python3.10/dist-packages/accelerate/big_modeling.py", line 24, in <module>
    from .hooks import (
  File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 30, in <module>
    from .utils.other import recursive_getattr
  File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/other.py", line 36, in <module>
    from .transformer_engine import convert_model
  File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/transformer_engine.py", line 21, in <module>
    import transformer_engine.pytorch as te
  File "/usr/local/lib/python3.10/dist-packages/transformer_engine/pytorch/__init__.py", line 6, in <module>
    from .module import LayerNormLinear
  File "/usr/local/lib/python3.10/dist-packages/transformer_engine/pytorch/module/__init__.py", line 6, in <module>
    from .layernorm_linear import LayerNormLinear
  File "/usr/local/lib/python3.10/dist-packages/transformer_engine/pytorch/module/layernorm_linear.py", line 13, in <module>
    from .. import cpp_extensions as tex
  File "/usr/local/lib/python3.10/dist-packages/transformer_engine/pytorch/cpp_extensions/__init__.py", line 6, in <module>
    from transformer_engine_extensions import *
ImportError: /usr/local/lib/python3.10/dist-packages/transformer_engine_extensions.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN5torch3jit17parseSchemaOrNameERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE

tried to uninstall transformer_engine according to https://github.com/chenfei-wu/TaskMatrix/issues/116 but also got errors:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/__init__.py", line 33, in <module>
    import tensorrt_llm.models as models
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/__init__.py", line 34, in <module>
    from .llama.model import LLaMAForCausalLM, LLaMAModel
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/model.py", line 32, in <module>
    from .convert import (load_hf_llama, load_weights_from_hf_by_shard,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/convert.py", line 31, in <module>
    from transformers.models.llama.modeling_llama import LlamaDecoderLayer
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 54, in <module>
    from flash_attn import flash_attn_func, flash_attn_varlen_func
  File "/usr/local/lib/python3.10/dist-packages/flash_attn/__init__.py", line 3, in <module>
    from flash_attn.flash_attn_interface import (
  File "/usr/local/lib/python3.10/dist-packages/flash_attn/flash_attn_interface.py", line 10, in <module>
    import flash_attn_2_cuda as flash_attn_cuda
ImportError: /usr/local/lib/python3.10/dist-packages/flash_attn_2_cuda.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN3c104cuda9SetDeviceEi
hijkzzz commented 1 week ago

Hi try make -C docker release_build