Closed Aoempty closed 3 months ago
Please set --context_fmha disable
during building engine because the fused mha kernel is not supported on Turing GPU, and it will fallback to unfused case.
Thanks, now it can convert successfully! However, when I running the command: python3 ../run.py --engine_dir llama-2-7b-engine --max_output_len 100 --tokenizer_dir meta-llama/Llama-2-7b-chat-hf --input_text "How do I count to nine in French?" I got the new Error: "root@81b5288ae872:/TensorRT-LLM/examples/llama# python3 ../run.py --engine_dir llama-2-7b-engine --max_output_len 100 --tokenizer_dir meta-llama/Llama-2-7b-chat-hf --input_text "How do I count to nine in French?"
[TensorRT-LLM] TensorRT-LLM version: 0.10.0.dev2024043000
[TensorRT-LLM][INFO] Engine version 0.10.0.dev2024043000 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'cross_attention' not found
[TensorRT-LLM][WARNING] Optional value for parameter cross_attention will not be set.
[TensorRT-LLM][WARNING] Parameter layer_types cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'layer_types' not found
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null
[TensorRT-LLM][WARNING] Optional value for parameter quant_algo will not be set.
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null
[TensorRT-LLM][WARNING] Optional value for parameter kv_cache_quant_algo will not be set.
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'num_medusa_heads' not found
[TensorRT-LLM][WARNING] Optional value for parameter num_medusa_heads will not be set.
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found
[TensorRT-LLM][WARNING] Optional value for parameter max_draft_len will not be set.
[TensorRT-LLM][INFO] MPI size: 1, rank: 0
Traceback (most recent call last):
File "/TensorRT-LLM/examples/llama/../run.py", line 564, in
It looks the TRT-LLM cannot load your engine successfully. Could you try rebuild the repo, and try the end to end workflow again?
Also, could you share the log of building engine? It might has some issue during building engine.
The contextFMHA doesn't support geforce 2080Ti whichi sm version is sm75.
https://github.com/NVIDIA/TensorRT-LLM/blob/main/cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/fmhaRunner.cpp#L91.
You may try to build the engine by disabling the context_fmha --context_fmha disable
or on other supported hardware.
System Info
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Expected behavior
I am following the quick start guide, and after it should compile the Llama2 Model into a TensorRT Engine.
actual behavior
[TensorRT-LLM] TensorRT-LLM version: 0.10.0.dev2024042300 [04/28/2024-15:57:09] [TRT-LLM] [I] Set bert_attention_plugin to float16. [04/28/2024-15:57:09] [TRT-LLM] [I] Set gpt_attention_plugin to float16. [04/28/2024-15:57:09] [TRT-LLM] [I] Set gemm_plugin to float16. [04/28/2024-15:57:09] [TRT-LLM] [I] Set nccl_plugin to float16. [04/28/2024-15:57:09] [TRT-LLM] [I] Set lookup_plugin to None. [04/28/2024-15:57:09] [TRT-LLM] [I] Set lora_plugin to None. [04/28/2024-15:57:09] [TRT-LLM] [I] Set moe_plugin to float16. [04/28/2024-15:57:09] [TRT-LLM] [I] Set mamba_conv1d_plugin to float16. [04/28/2024-15:57:09] [TRT-LLM] [I] Set context_fmha to True. [04/28/2024-15:57:09] [TRT-LLM] [I] Set context_fmha_fp32_acc to False. [04/28/2024-15:57:09] [TRT-LLM] [I] Set paged_kv_cache to True. [04/28/2024-15:57:09] [TRT-LLM] [I] Set remove_input_padding to True. [04/28/2024-15:57:09] [TRT-LLM] [I] Set use_custom_all_reduce to True. [04/28/2024-15:57:09] [TRT-LLM] [I] Set multi_block_mode to False. [04/28/2024-15:57:09] [TRT-LLM] [I] Set enable_xqa to True. [04/28/2024-15:57:09] [TRT-LLM] [I] Set attention_qk_half_accumulation to False. [04/28/2024-15:57:09] [TRT-LLM] [I] Set tokens_per_block to 128. [04/28/2024-15:57:09] [TRT-LLM] [I] Set use_paged_context_fmha to False. [04/28/2024-15:57:09] [TRT-LLM] [I] Set use_fp8_context_fmha to False. [04/28/2024-15:57:09] [TRT-LLM] [I] Set use_context_fmha_for_generation to False. [04/28/2024-15:57:09] [TRT-LLM] [I] Set multiple_profiles to False. [04/28/2024-15:57:09] [TRT-LLM] [I] Set paged_state to True. [04/28/2024-15:57:09] [TRT-LLM] [I] Set streamingllm to False. [04/28/2024-15:57:09] [TRT-LLM] [W] remove_input_padding is enabled, while max_num_tokens is not set, setting to max_batch_sizemax_input_len. It may not be optimal to set max_num_tokens=max_batch_sizemax_input_len when remove_input_padding is enabled, because the number of packed input tokens are very likely to be smaller, we strongly recommend to set max_num_tokens according to your workloads. [04/28/2024-15:57:09] [TRT-LLM] [W] remove_input_padding is enabled, while opt_num_tokens is not set, setting to max_batch_size*max_beam_width.
[04/28/2024-15:57:10] [TRT-LLM] [I] Compute capability: (7, 5) [04/28/2024-15:57:10] [TRT-LLM] [I] SM count: 68 [04/28/2024-15:57:10] [TRT-LLM] [I] SM clock: 2100 MHz [04/28/2024-15:57:10] [TRT-LLM] [I] int4 TFLOPS: 584 [04/28/2024-15:57:10] [TRT-LLM] [I] int8 TFLOPS: 292 [04/28/2024-15:57:10] [TRT-LLM] [I] fp8 TFLOPS: 0 [04/28/2024-15:57:10] [TRT-LLM] [I] float16 TFLOPS: 146 [04/28/2024-15:57:10] [TRT-LLM] [I] bfloat16 TFLOPS: 0 [04/28/2024-15:57:10] [TRT-LLM] [I] float32 TFLOPS: 18 [04/28/2024-15:57:10] [TRT-LLM] [I] Total Memory: 11 GiB [04/28/2024-15:57:10] [TRT-LLM] [I] Memory clock: 7000 MHz [04/28/2024-15:57:10] [TRT-LLM] [I] Memory bus width: 352 [04/28/2024-15:57:10] [TRT-LLM] [I] Memory bandwidth: 616 GB/s [04/28/2024-15:57:10] [TRT-LLM] [I] NVLink is active: False [04/28/2024-15:57:10] [TRT-LLM] [I] PCIe speed: 2500 Mbps [04/28/2024-15:57:10] [TRT-LLM] [I] PCIe link width: 16 [04/28/2024-15:57:10] [TRT-LLM] [I] PCIe bandwidth: 5 GB/s [04/28/2024-15:57:10] [TRT] [I] [MemUsageChange] Init CUDA: CPU +13, GPU +0, now: CPU 259, GPU 157 (MiB) [04/28/2024-15:57:12] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +978, GPU +180, now: CPU 1373, GPU 337 (MiB) [04/28/2024-15:57:12] [TRT-LLM] [I] Set nccl_plugin to None. [04/28/2024-15:57:12] [TRT-LLM] [I] Set use_custom_all_reduce to True. [04/28/2024-15:57:12] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/vocab_embedding/GATHER_0_output_0 and LLaMAForCausalLM/transformer/layers/0/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float. [04/28/2024-15:57:12] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/0/input_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/0/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float. terminate called after throwing an instance of 'tensorrt_llm::common::TllmException' what(): [TensorRT-LLM][ERROR] Assertion failed: Unsupported architecture (/home/jenkins/agent/workspace/LLM/main/L0_PostMerge/tensorrt_llm/cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/fmhaRunner.cpp:89) 1 0x7f120b494f3b tensorrt_llm::common::throwRuntimeError(char const, int, std::string const&) + 82 2 0x7f120b497194 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(+0x691194) [0x7f120b497194] 3 0x7f11deca9eef tensorrt_llm::plugins::GPTAttentionPluginCommon::initialize() + 415 4 0x7f11decd1e6d tensorrt_llm::plugins::GPTAttentionPlugin tensorrt_llm::plugins::GPTAttentionPluginCommon::cloneImpl() const + 573
5 0x7f12dfbd1279 /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.9(+0xae3279) [0x7f12dfbd1279]
6 0x7f12dfb1f02e /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.9(+0xa3102e) [0x7f12dfb1f02e]
7 0x7f128b6dfcef /usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so(+0xdfcef) [0x7f128b6dfcef]
8 0x7f128b643443 /usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so(+0x43443) [0x7f128b643443]
9 0x5617f827010e /usr/bin/python3(+0x15a10e) [0x5617f827010e]
10 0x5617f8266a7b _PyObject_MakeTpCall + 603
11 0x5617f827eacb /usr/bin/python3(+0x168acb) [0x5617f827eacb]
12 0x5617f825ecfa _PyEval_EvalFrameDefault + 24906
13 0x5617f82709fc _PyFunction_Vectorcall + 124
14 0x5617f827f492 PyObject_Call + 290
15 0x5617f825b5d7 _PyEval_EvalFrameDefault + 10791
16 0x5617f82709fc _PyFunction_Vectorcall + 124
17 0x5617f827f492 PyObject_Call + 290
18 0x5617f825b5d7 _PyEval_EvalFrameDefault + 10791
19 0x5617f827e7f1 /usr/bin/python3(+0x1687f1) [0x5617f827e7f1]
20 0x5617f827f492 PyObject_Call + 290
21 0x5617f825b5d7 _PyEval_EvalFrameDefault + 10791
22 0x5617f82709fc _PyFunction_Vectorcall + 124
23 0x5617f8265cbd _PyObject_FastCallDictTstate + 365
24 0x5617f827b86c _PyObject_Call_Prepend + 92
25 0x5617f8396700 /usr/bin/python3(+0x280700) [0x5617f8396700]
26 0x5617f8266a7b _PyObject_MakeTpCall + 603
27 0x5617f8260150 _PyEval_EvalFrameDefault + 30112
28 0x5617f827e7f1 /usr/bin/python3(+0x1687f1) [0x5617f827e7f1]
29 0x5617f827f492 PyObject_Call + 290
30 0x5617f825b5d7 _PyEval_EvalFrameDefault + 10791
31 0x5617f82709fc _PyFunction_Vectorcall + 124
32 0x5617f8265cbd _PyObject_FastCallDictTstate + 365
33 0x5617f827b86c _PyObject_Call_Prepend + 92
34 0x5617f8396700 /usr/bin/python3(+0x280700) [0x5617f8396700]
35 0x5617f827f42b PyObject_Call + 187
36 0x5617f825b5d7 _PyEval_EvalFrameDefault + 10791
37 0x5617f827e7f1 /usr/bin/python3(+0x1687f1) [0x5617f827e7f1]
38 0x5617f825a53c _PyEval_EvalFrameDefault + 6540
39 0x5617f827e7f1 /usr/bin/python3(+0x1687f1) [0x5617f827e7f1]
40 0x5617f827f492 PyObject_Call + 290
41 0x5617f825b5d7 _PyEval_EvalFrameDefault + 10791
42 0x5617f827e7f1 /usr/bin/python3(+0x1687f1) [0x5617f827e7f1]
43 0x5617f827f492 PyObject_Call + 290
44 0x5617f825b5d7 _PyEval_EvalFrameDefault + 10791
45 0x5617f82709fc _PyFunction_Vectorcall + 124
46 0x5617f8265cbd _PyObject_FastCallDictTstate + 365
47 0x5617f827b86c _PyObject_Call_Prepend + 92
48 0x5617f8396700 /usr/bin/python3(+0x280700) [0x5617f8396700]
49 0x5617f827f42b PyObject_Call + 187
50 0x5617f825b5d7 _PyEval_EvalFrameDefault + 10791
51 0x5617f82709fc _PyFunction_Vectorcall + 124
52 0x5617f825926d _PyEval_EvalFrameDefault + 1725
53 0x5617f82709fc _PyFunction_Vectorcall + 124
54 0x5617f827f492 PyObject_Call + 290
55 0x5617f825b5d7 _PyEval_EvalFrameDefault + 10791
56 0x5617f82709fc _PyFunction_Vectorcall + 124
57 0x5617f827f492 PyObject_Call + 290
58 0x5617f825b5d7 _PyEval_EvalFrameDefault + 10791
59 0x5617f82709fc _PyFunction_Vectorcall + 124
60 0x5617f827f492 PyObject_Call + 290
61 0x5617f825b5d7 _PyEval_EvalFrameDefault + 10791
62 0x5617f82709fc _PyFunction_Vectorcall + 124
63 0x5617f825926d _PyEval_EvalFrameDefault + 1725
64 0x5617f82559c6 /usr/bin/python3(+0x13f9c6) [0x5617f82559c6]
65 0x5617f834b256 PyEval_EvalCode + 134
66 0x5617f8376108 /usr/bin/python3(+0x260108) [0x5617f8376108]
67 0x5617f836f9cb /usr/bin/python3(+0x2599cb) [0x5617f836f9cb]
68 0x5617f8375e55 /usr/bin/python3(+0x25fe55) [0x5617f8375e55]
69 0x5617f8375338 _PyRun_SimpleFileObject + 424
70 0x5617f8374f83 _PyRun_AnyFileObject + 67
71 0x5617f8367a5e Py_RunMain + 702
72 0x5617f833e02d Py_BytesMain + 45
73 0x7f148ebe2d90 /lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7f148ebe2d90]
74 0x7f148ebe2e40 libc_start_main + 128
75 0x5617f833df25 _start + 37
[e62a70965c65:02001] Process received signal
[e62a70965c65:02001] Signal: Aborted (6)
[e62a70965c65:02001] Signal code: (-6)
[e62a70965c65:02001] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f148ebfb520]
[e62a70965c65:02001] [ 1] /lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7f148ec4f9fc]
[e62a70965c65:02001] [ 2] /lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7f148ebfb476]
[e62a70965c65:02001] [ 3] /lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x7f148ebe17f3]
[e62a70965c65:02001] [ 4] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xa2b9e)[0x7f13ec476b9e]
[e62a70965c65:02001] [ 5] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xae20c)[0x7f13ec48220c]
[e62a70965c65:02001] [ 6] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xad1e9)[0x7f13ec4811e9]
[e62a70965c65:02001] [ 7] /lib/x86_64-linux-gnu/libstdc++.so.6(gxx_personality_v0+0x99)[0x7f13ec481959]
[e62a70965c65:02001] [ 8] /lib/x86_64-linux-gnu/libgcc_s.so.1(+0x16884)[0x7f148e8eb884]
[e62a70965c65:02001] [ 9] /lib/x86_64-linux-gnu/libgcc_s.so.1(_Unwind_Resume+0x12d)[0x7f148e8ec2dd]
[e62a70965c65:02001] [10] /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(+0x691118)[0x7f120b497118]
[e62a70965c65:02001] [11] /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(_ZN12tensorrt_llm7plugins24GPTAttentionPluginCommon10initializeEv+0x19f)[0x7f11deca9eef]
[e62a70965c65:02001] [12] /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(_ZNK12tensorrt_llm7plugins24GPTAttentionPluginCommon9cloneImplINS0_18GPTAttentionPluginEEEPT_v+0x23d)[0x7f11decd1e6d]
[e62a70965c65:02001] [13] /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.9(+0xae3279)[0x7f12dfbd1279]
[e62a70965c65:02001] [14] /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.9(+0xa3102e)[0x7f12dfb1f02e]
[e62a70965c65:02001] [15] /usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so(+0xdfcef)[0x7f128b6dfcef]
[e62a70965c65:02001] [16] /usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so(+0x43443)[0x7f128b643443]
[e62a70965c65:02001] [17] /usr/bin/python3(+0x15a10e)[0x5617f827010e]
[e62a70965c65:02001] [18] /usr/bin/python3(_PyObject_MakeTpCall+0x25b)[0x5617f8266a7b]
[e62a70965c65:02001] [19] /usr/bin/python3(+0x168acb)[0x5617f827eacb]
[e62a70965c65:02001] [20] /usr/bin/python3(_PyEval_EvalFrameDefault+0x614a)[0x5617f825ecfa]
[e62a70965c65:02001] [21] /usr/bin/python3(_PyFunction_Vectorcall+0x7c)[0x5617f82709fc]
[e62a70965c65:02001] [22] /usr/bin/python3(PyObject_Call+0x122)[0x5617f827f492]
[e62a70965c65:02001] [23] /usr/bin/python3(_PyEval_EvalFrameDefault+0x2a27)[0x5617f825b5d7]
[e62a70965c65:02001] [24] /usr/bin/python3(_PyFunction_Vectorcall+0x7c)[0x5617f82709fc]
[e62a70965c65:02001] [25] /usr/bin/python3(PyObject_Call+0x122)[0x5617f827f492]
[e62a70965c65:02001] [26] /usr/bin/python3(_PyEval_EvalFrameDefault+0x2a27)[0x5617f825b5d7]
[e62a70965c65:02001] [27] /usr/bin/python3(+0x1687f1)[0x5617f827e7f1]
[e62a70965c65:02001] [28] /usr/bin/python3(PyObject_Call+0x122)[0x5617f827f492]
[e62a70965c65:02001] [29] /usr/bin/python3(_PyEval_EvalFrameDefault+0x2a27)[0x5617f825b5d7]
[e62a70965c65:02001] End of error message
Aborted (core dumped)
additional notes
None