NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.55k stars 971 forks source link

Llama 3 70B FP8 engine build failed with FMHA #2125

Open ayush1399 opened 2 months ago

ayush1399 commented 2 months ago

System Info

AWS p5 (4 x 80GB H100 GPUs) TensorRT-LLM v0.11.0

Who can help?

@byshiue @Tracin

Information

Tasks

Reproduction

python ./quantize.py --model_dir ./Meta-Llama-3-70B-Instruct --dtype bfloat16 --output_dir ./Meta-Llama-3-70B-Instruct_fp8 --calib_size 1024 --calib_dataset /home/triton-server/calibration --tp_size 4 --qformat fp8

trtllm-build --checkpoint_dir ./Meta-Llama-3-70B-Instruct_fp8 --output_dir ./Meta-Llama-3-70B-Instruct_fp8_engine_fmha --gemm_plugin auto --workers 1 --use_paged_context_fmha enable --use_fp8_context_fmha enable --max_batch_size 16

Expected behavior

Engine created succesfully.

actual behavior

Engine build fails with

[08/18/2024-23:36:42] [TRT] [W] Detected layernorm nodes in FP16.
[08/18/2024-23:36:42] [TRT] [W] Running layernorm after self-attention in FP16 may cause overflow. Exporting the model to the latest available ONNX opset (later than opset 17) to use the INormalizationLayer, or forcing layernorm layers to run in FP32 precision can help with preserving accuracy.
[08/18/2024-23:36:42] [TRT] [W] [RemoveDeadLayers] Input Tensor position_ids is unused or used only at compile-time, but is not being removed.
[08/18/2024-23:36:42] [TRT] [I] Global timing cache in use. Profiling results in this builder pass will be stored.
terminate called after throwing an instance of 'tensorrt_llm::common::TllmException'
  what():  [TensorRT-LLM][ERROR] Assertion failed: getIdx() should not be used with entry 16
 (/workspace/tensorrt_llm/cpp/tensorrt_llm/plugins/gptAttentionPlugin/gptAttentionPlugin.cpp:127)
1       0x7fc56bf865ce /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(+0x7f5ce) [0x7fc56bf865ce]
2       0x7fc56bf86df0 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(+0x7fdf0) [0x7fc56bf86df0]
3       0x7fc56c020d2e tensorrt_llm::plugins::GPTAttentionPlugin::supportsFormatCombination(int, nvinfer1::PluginTensorDesc const*, int, int) + 1118
4       0x7fc986625a14 /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.10(+0xafda14) [0x7fc986625a14]
5       0x7fc9868d3d33 /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.10(+0xdabd33) [0x7fc9868d3d33]
6       0x7fc98665ef2d /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.10(+0xb36f2d) [0x7fc98665ef2d]
7       0x7fc986939abe /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.10(+0xe11abe) [0x7fc986939abe]
8       0x7fc9867e116f /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.10(+0xcb916f) [0x7fc9867e116f]
9       0x7fc9867e9e0c /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.10(+0xcc1e0c) [0x7fc9867e9e0c]
10      0x7fc986926c19 /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.10(+0xdfec19) [0x7fc986926c19]
11      0x7fc98692e21c /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.10(+0xe0621c) [0x7fc98692e21c]
12      0x7fc986930328 /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.10(+0xe08328) [0x7fc986930328]
13      0x7fc98657f2ac /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.10(+0xa572ac) [0x7fc98657f2ac]
14      0x7fc986584501 /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.10(+0xa5c501) [0x7fc986584501]
15      0x7fc986584f0b /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.10(+0xa5cf0b) [0x7fc986584f0b]
16      0x7fc92bca7458 /usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so(+0xa7458) [0x7fc92bca7458]
17      0x7fc92bc458f3 /usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so(+0x458f3) [0x7fc92bc458f3]
18      0x5578df496c9e /usr/bin/python3(+0x15ac9e) [0x5578df496c9e]
19      0x5578df48d3cb _PyObject_MakeTpCall + 603
20      0x5578df4a53eb /usr/bin/python3(+0x1693eb) [0x5578df4a53eb]
21      0x5578df48559a _PyEval_EvalFrameDefault + 25674
22      0x5578df49759c _PyFunction_Vectorcall + 124
23      0x5578df481a9d _PyEval_EvalFrameDefault + 10573
24      0x5578df49759c _PyFunction_Vectorcall + 124
25      0x5578df47f96e _PyEval_EvalFrameDefault + 2078
26      0x5578df49759c _PyFunction_Vectorcall + 124
27      0x5578df47f827 _PyEval_EvalFrameDefault + 1751
28      0x5578df49759c _PyFunction_Vectorcall + 124
29      0x5578df4a5db2 PyObject_Call + 290
30      0x5578df481a9d _PyEval_EvalFrameDefault + 10573
31      0x5578df49759c _PyFunction_Vectorcall + 124
32      0x5578df4a5db2 PyObject_Call + 290
33      0x5578df481a9d _PyEval_EvalFrameDefault + 10573
34      0x5578df49759c _PyFunction_Vectorcall + 124
35      0x5578df4a5db2 PyObject_Call + 290
36      0x5578df481a9d _PyEval_EvalFrameDefault + 10573
37      0x5578df49759c _PyFunction_Vectorcall + 124
38      0x5578df47f827 _PyEval_EvalFrameDefault + 1751
39      0x5578df47bf96 /usr/bin/python3(+0x13ff96) [0x5578df47bf96]
40      0x5578df571c66 PyEval_EvalCode + 134
41      0x5578df59cb38 /usr/bin/python3(+0x260b38) [0x5578df59cb38]
42      0x5578df5963fb /usr/bin/python3(+0x25a3fb) [0x5578df5963fb]
43      0x5578df59c885 /usr/bin/python3(+0x260885) [0x5578df59c885]
44      0x5578df59bd68 _PyRun_SimpleFileObject + 424
45      0x5578df59b9b3 _PyRun_AnyFileObject + 67
46      0x5578df58e45e Py_RunMain + 702
47      0x5578df564a3d Py_BytesMain + 45
48      0x7fc9ab1e4d90 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7fc9ab1e4d90]
49      0x7fc9ab1e4e40 __libc_start_main + 128
50      0x5578df564935 _start + 37

additional notes

The engine build runs fine when I don't include --use_paged_context_fmha enable --use_fp8_context_fmha enable on running trtllm-build.

Kefeng-Duan commented 2 months ago

Hi, @ayush1399 it seems a version mismatched issue, could you:

  1. update to the latest commit
  2. install the latest pypi
  3. clean and rebuild trtllm
  4. rebuild the engine
github-actions[bot] commented 4 weeks ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."