NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.46k stars 957 forks source link

Assertion failed: getIdx() should not be used with entry 10 #1661

Open Hukongtao opened 5 months ago

Hukongtao commented 5 months ago

System Info

CPU:x86-64 GPU: a100 tensorrt-llm: 0.11.0.dev2024051400

Who can help?

@ncomly-nvidia @byshiue

Information

Tasks

Reproduction

Problem background:
I want to use TRT-LLM to optimize the Qwen-32B-GPTQ-4bit model, and my output token number is only 1. In order to save memory, I want to set use_cache=Fasle. Like this:
image https://github.com/NVIDIA/TensorRT-LLM/blob/5d8ca2faf74c494f220c8f71130340b513eea9a9/tensorrt_llm/models/modeling_utils.py#L601

Then I run

set -ex

python3 convert_checkpoint.py \
    --model_dir         ./Qwen1.5-32B-Chat-GPTQ-Int4/ \
    --output_dir        ./tllm_checkpoint_1gpu_gptq/ \
    --dtype float16 \
    --use_weight_only \
    --weight_only_precision int4_gptq \
    --per_group \
    --load_model_on_cpu \
    --qwen_type qwen2

python3 build.py \
    --checkpoint_dir    ./tllm_checkpoint_1gpu_gptq/ \
    --output_dir        ./trt_engines/int4_GPTQ/1-gpu/ \
    --gemm_plugin float16 \
    --max_input_len 4096 \
    --max_output_len 2 \
    --max_batch_size 1 \
    --gather_all_token_logits

But I got:

[05/24/2024-13:21:11] [TRT] [W] [RemoveDeadLayers] Input Tensor position_ids is unused or used only at compile-time, but is not being removed.
[05/24/2024-13:21:11] [TRT] [W] [RemoveDeadLayers] Input Tensor kv_cache_block_offsets is unused or used only at compile-time, but is not being removed.
[05/24/2024-13:21:11] [TRT] [W] [RemoveDeadLayers] Input Tensor host_kv_cache_block_offsets is unused or used only at compile-time, but is not being removed.
[05/24/2024-13:21:11] [TRT] [W] [RemoveDeadLayers] Input Tensor host_kv_cache_pool_pointers is unused or used only at compile-time, but is not being removed.
[05/24/2024-13:21:11] [TRT] [W] [RemoveDeadLayers] Input Tensor sequence_length is unused or used only at compile-time, but is not being removed.
[05/24/2024-13:21:11] [TRT] [W] [RemoveDeadLayers] Input Tensor host_past_key_value_lengths is unused or used only at compile-time, but is not being removed.
[05/24/2024-13:21:11] [TRT] [W] [RemoveDeadLayers] Input Tensor cache_indirection is unused or used only at compile-time, but is not being removed.
[05/24/2024-13:21:11] [TRT] [I] Global timing cache in use. Profiling results in this builder pass will be stored.
terminate called after throwing an instance of 'tensorrt_llm::common::TllmException'
  what():  [TensorRT-LLM][ERROR] Assertion failed: getIdx() should not be used with entry 10
 (/usr/local/TensorRT-LLM/cpp/tensorrt_llm/plugins/gptAttentionPlugin/gptAttentionPlugin.cpp:121)
1       0x7f3ea222e08f /usr/local/TensorRT-LLM/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(+0x5608f) [0x7f3ea222e08f]
2       0x7f3ea22a5dcb tensorrt_llm::plugins::GPTAttentionPlugin::getIdx(tensorrt_llm::plugins::GPTAttentionPlugin::IdxEntry const&) const + 107
3       0x7f3ea22a60ac tensorrt_llm::plugins::GPTAttentionPlugin::supportsFormatCombination(int, nvinfer1::PluginTensorDesc const*, int, int) + 700
4       0x7f3f54560ba4 /usr/local/tensorrt/lib/libnvinfer.so.10(+0xafcba4) [0x7f3f54560ba4]
5       0x7f3f547eb013 /usr/local/tensorrt/lib/libnvinfer.so.10(+0xd87013) [0x7f3f547eb013]
6       0x7f3f54577b8d /usr/local/tensorrt/lib/libnvinfer.so.10(+0xb13b8d) [0x7f3f54577b8d]
7       0x7f3f54850b0e /usr/local/tensorrt/lib/libnvinfer.so.10(+0xdecb0e) [0x7f3f54850b0e]
8       0x7f3f546fa03f /usr/local/tensorrt/lib/libnvinfer.so.10(+0xc9603f) [0x7f3f546fa03f]
9       0x7f3f54701edc /usr/local/tensorrt/lib/libnvinfer.so.10(+0xc9dedc) [0x7f3f54701edc]
10      0x7f3f5483dad9 /usr/local/tensorrt/lib/libnvinfer.so.10(+0xdd9ad9) [0x7f3f5483dad9]
11      0x7f3f5484507c /usr/local/tensorrt/lib/libnvinfer.so.10(+0xde107c) [0x7f3f5484507c]
12      0x7f3f54847071 /usr/local/tensorrt/lib/libnvinfer.so.10(+0xde3071) [0x7f3f54847071]
13      0x7f3f5448c61c /usr/local/tensorrt/lib/libnvinfer.so.10(+0xa2861c) [0x7f3f5448c61c]
14      0x7f3f54491837 /usr/local/tensorrt/lib/libnvinfer.so.10(+0xa2d837) [0x7f3f54491837]
15      0x7f3f544921af /usr/local/tensorrt/lib/libnvinfer.so.10(+0xa2e1af) [0x7f3f544921af]
16      0x7f3f623d6558 /usr/local/lib/python3.9/dist-packages/tensorrt/tensorrt.so(+0xa6558) [0x7f3f623d6558]
17      0x7f3f62375833 /usr/local/lib/python3.9/dist-packages/tensorrt/tensorrt.so(+0x45833) [0x7f3f62375833]
18            0x53f350 python3() [0x53f350]
19            0x51d89b _PyObject_MakeTpCall + 923
20            0x53bf25 python3() [0x53bf25]
21            0x51af55 python3() [0x51af55]
22            0x51975d _PyEval_EvalFrameDefault + 31949
23            0x528b63 _PyFunction_Vectorcall + 419
24            0x513e8b _PyEval_EvalFrameDefault + 9211
25            0x5106ed python3() [0x5106ed]
26            0x528d21 _PyFunction_Vectorcall + 865
27            0x51af55 python3() [0x51af55]
28            0x519913 _PyEval_EvalFrameDefault + 32387
29            0x528b63 _PyFunction_Vectorcall + 419
30            0x51af55 python3() [0x51af55]
31            0x519fc7 _PyEval_EvalFrameDefault + 34103
32            0x5106ed python3() [0x5106ed]
33            0x528d21 _PyFunction_Vectorcall + 865
34            0x53c361 PyObject_Call + 193
35            0x513e8b _PyEval_EvalFrameDefault + 9211
36            0x5106ed python3() [0x5106ed]
37            0x528d21 _PyFunction_Vectorcall + 865
38            0x53c361 PyObject_Call + 193
39            0x513e8b _PyEval_EvalFrameDefault + 9211
40            0x510fe7 python3() [0x510fe7]
41            0x528d21 _PyFunction_Vectorcall + 865
42            0x53c361 PyObject_Call + 193
43            0x513e8b _PyEval_EvalFrameDefault + 9211
44            0x528b63 _PyFunction_Vectorcall + 419
45            0x51af55 python3() [0x51af55]
46            0x519fc7 _PyEval_EvalFrameDefault + 34103
47            0x5106ed python3() [0x5106ed]
48            0x510497 _PyEval_EvalCodeWithName + 71
49            0x5f5be3 PyEval_EvalCode + 35
50            0x5fa670 python3() [0x5fa670]
51            0x5298c4 python3() [0x5298c4]
52            0x51af55 python3() [0x51af55]
53            0x519fc7 _PyEval_EvalFrameDefault + 34103
54            0x5106ed python3() [0x5106ed]
55            0x528d21 _PyFunction_Vectorcall + 865
56            0x51af55 python3() [0x51af55]
57            0x519fc7 _PyEval_EvalFrameDefault + 34103
58            0x5106ed python3() [0x5106ed]
59            0x528d21 _PyFunction_Vectorcall + 865
60            0x51af55 python3() [0x51af55]
61            0x518c31 _PyEval_EvalFrameDefault + 29089
62            0x5106ed python3() [0x5106ed]
63            0x528d21 _PyFunction_Vectorcall + 865
64            0x51af55 python3() [0x51af55]
65            0x518c31 _PyEval_EvalFrameDefault + 29089
66            0x528b63 _PyFunction_Vectorcall + 419
67            0x511fb5 _PyEval_EvalFrameDefault + 1317
68            0x528b63 _PyFunction_Vectorcall + 419
69            0x516e76 _PyEval_EvalFrameDefault + 21478
70            0x5106ed python3() [0x5106ed]
71            0x510497 _PyEval_EvalCodeWithName + 71
72            0x5f5be3 PyEval_EvalCode + 35
73            0x5fa670 python3() [0x5fa670]
74            0x5298c4 python3() [0x5298c4]
75            0x511fb5 _PyEval_EvalFrameDefault + 1317
76            0x5106ed python3() [0x5106ed]
77            0x528d21 _PyFunction_Vectorcall + 865
78            0x511fb5 _PyEval_EvalFrameDefault + 1317
79            0x5106ed python3() [0x5106ed]
80            0x528d21 _PyFunction_Vectorcall + 865
81            0x60eea0 python3() [0x60eea0]
82            0x60d35d Py_RunMain + 301
83            0x5ea6e9 Py_BytesMain + 41
84      0x7f40ca81fd0a __libc_start_main + 234
85            0x5ea5ea _start + 42
[mlxlabhi6c6whu6646ab5f-20240517005703-edgza0-w1gue4-worker:99363] *** Process received signal ***
[mlxlabhi6c6whu6646ab5f-20240517005703-edgza0-w1gue4-worker:99363] Signal: Aborted (6)
[mlxlabhi6c6whu6646ab5f-20240517005703-edgza0-w1gue4-worker:99363] Signal code:  (-6)
[mlxlabhi6c6whu6646ab5f-20240517005703-edgza0-w1gue4-worker:99363] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x13140)[0x7f40cab80140]
[mlxlabhi6c6whu6646ab5f-20240517005703-edgza0-w1gue4-worker:99363] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x141)[0x7f40ca834ce1]
[mlxlabhi6c6whu6646ab5f-20240517005703-edgza0-w1gue4-worker:99363] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x123)[0x7f40ca81e537]
[mlxlabhi6c6whu6646ab5f-20240517005703-edgza0-w1gue4-worker:99363] [ 3] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x9a7ec)[0x7f40c86217ec]
[mlxlabhi6c6whu6646ab5f-20240517005703-edgza0-w1gue4-worker:99363] [ 4] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xa5966)[0x7f40c862c966]
[mlxlabhi6c6whu6646ab5f-20240517005703-edgza0-w1gue4-worker:99363] [ 5] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xa4a49)[0x7f40c862ba49]
[mlxlabhi6c6whu6646ab5f-20240517005703-edgza0-w1gue4-worker:99363] [ 6] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(__gxx_personality_v0+0x271)[0x7f40c862c381]
[mlxlabhi6c6whu6646ab5f-20240517005703-edgza0-w1gue4-worker:99363] [ 7] /lib/x86_64-linux-gnu/libgcc_s.so.1(+0x1073f)[0x7f40c930173f]
[mlxlabhi6c6whu6646ab5f-20240517005703-edgza0-w1gue4-worker:99363] [ 8] /lib/x86_64-linux-gnu/libgcc_s.so.1(_Unwind_Resume+0x126)[0x7f40c93020e6]
[mlxlabhi6c6whu6646ab5f-20240517005703-edgza0-w1gue4-worker:99363] [ 9] /usr/local/TensorRT-LLM/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(+0x567a4)[0x7f3ea222e7a4]
[mlxlabhi6c6whu6646ab5f-20240517005703-edgza0-w1gue4-worker:99363] [10] /usr/local/TensorRT-LLM/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(_ZN12tensorrt_llm7plugins18GPTAttentionPlugin25supportsFormatCombinationEiPKN8nvinfer116PluginTensorDescEii+0x2bc)[0x7f3ea22a60ac]
[mlxlabhi6c6whu6646ab5f-20240517005703-edgza0-w1gue4-worker:99363] [11] /usr/local/tensorrt/lib/libnvinfer.so.10(+0xafcba4)[0x7f3f54560ba4]
[mlxlabhi6c6whu6646ab5f-20240517005703-edgza0-w1gue4-worker:99363] [12] /usr/local/tensorrt/lib/libnvinfer.so.10(+0xd87013)[0x7f3f547eb013]
[mlxlabhi6c6whu6646ab5f-20240517005703-edgza0-w1gue4-worker:99363] [13] /usr/local/tensorrt/lib/libnvinfer.so.10(+0xb13b8d)[0x7f3f54577b8d]
[mlxlabhi6c6whu6646ab5f-20240517005703-edgza0-w1gue4-worker:99363] [14] /usr/local/tensorrt/lib/libnvinfer.so.10(+0xdecb0e)[0x7f3f54850b0e]
[mlxlabhi6c6whu6646ab5f-20240517005703-edgza0-w1gue4-worker:99363] [15] /usr/local/tensorrt/lib/libnvinfer.so.10(+0xc9603f)[0x7f3f546fa03f]
[mlxlabhi6c6whu6646ab5f-20240517005703-edgza0-w1gue4-worker:99363] [16] /usr/local/tensorrt/lib/libnvinfer.so.10(+0xc9dedc)[0x7f3f54701edc]
[mlxlabhi6c6whu6646ab5f-20240517005703-edgza0-w1gue4-worker:99363] [17] /usr/local/tensorrt/lib/libnvinfer.so.10(+0xdd9ad9)[0x7f3f5483dad9]
[mlxlabhi6c6whu6646ab5f-20240517005703-edgza0-w1gue4-worker:99363] [18] /usr/local/tensorrt/lib/libnvinfer.so.10(+0xde107c)[0x7f3f5484507c]
[mlxlabhi6c6whu6646ab5f-20240517005703-edgza0-w1gue4-worker:99363] [19] /usr/local/tensorrt/lib/libnvinfer.so.10(+0xde3071)[0x7f3f54847071]
[mlxlabhi6c6whu6646ab5f-20240517005703-edgza0-w1gue4-worker:99363] [20] /usr/local/tensorrt/lib/libnvinfer.so.10(+0xa2861c)[0x7f3f5448c61c]
[mlxlabhi6c6whu6646ab5f-20240517005703-edgza0-w1gue4-worker:99363] [21] /usr/local/tensorrt/lib/libnvinfer.so.10(+0xa2d837)[0x7f3f54491837]
[mlxlabhi6c6whu6646ab5f-20240517005703-edgza0-w1gue4-worker:99363] [22] /usr/local/tensorrt/lib/libnvinfer.so.10(+0xa2e1af)[0x7f3f544921af]
[mlxlabhi6c6whu6646ab5f-20240517005703-edgza0-w1gue4-worker:99363] [23] /usr/local/lib/python3.9/dist-packages/tensorrt/tensorrt.so(+0xa6558)[0x7f3f623d6558]
[mlxlabhi6c6whu6646ab5f-20240517005703-edgza0-w1gue4-worker:99363] [24] /usr/local/lib/python3.9/dist-packages/tensorrt/tensorrt.so(+0x45833)[0x7f3f62375833]
[mlxlabhi6c6whu6646ab5f-20240517005703-edgza0-w1gue4-worker:99363] [25] python3[0x53f350]
[mlxlabhi6c6whu6646ab5f-20240517005703-edgza0-w1gue4-worker:99363] [26] python3(_PyObject_MakeTpCall+0x39b)[0x51d89b]
[mlxlabhi6c6whu6646ab5f-20240517005703-edgza0-w1gue4-worker:99363] [27] python3[0x53bf25]
[mlxlabhi6c6whu6646ab5f-20240517005703-edgza0-w1gue4-worker:99363] [28] python3[0x51af55]
[mlxlabhi6c6whu6646ab5f-20240517005703-edgza0-w1gue4-worker:99363] [29] python3(_PyEval_EvalFrameDefault+0x7ccd)[0x51975d]
[mlxlabhi6c6whu6646ab5f-20240517005703-edgza0-w1gue4-worker:99363] *** End of error message ***

Expected behavior

run successfully

actual behavior

error

additional notes

no

Hukongtao commented 5 months ago

Can TRT-LLM support use_cache=False ? Like transformers:model.generate(use_cache=False)

github-actions[bot] commented 3 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."