NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.29k stars 925 forks source link

Assertion failed: Can't free tmp workspace for GEMM tactics profiling. #1841

Open naphatkps opened 3 months ago

naphatkps commented 3 months ago

System Info

Who can help?

@Tracin @byshiue

Information

Tasks

Reproduction

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig from peft import PeftModel import torch

base_model_name_or_path = "SeaLLMs/SeaLLM-7B-v2" adaptor_model_path = "./model_weights/lora_adaptor/ft_model" new_model_path = "./model_weights/seallm_with_lora"

model_for_merge = AutoModelForCausalLM.from_pretrained( base_model_name_or_path, torch_dtype=torch.float16, ) full_model = PeftModel.from_pretrained(model_for_merge, model_id=adaptor_model_path, ) full_model = full_model.base_model.merge_and_unload()
full_model.save_pretrained(new_model_path)

tokenizer = AutoTokenizer.from_pretrained( base_model_name_or_path, padding_side='left', add_eos_token= True, add_bos_token= True, trust_remote_code= True, )

tokenizer.save_pretrained(new_model_path)

- `cd examples/llama`
- convert model 

python convert_checkpoint.py --model_dir ../../model_weights/seallm_with_lora_pk \ --output_dir ./tmp/seallm/7B/checkpoint_merged_LoRA/int8/1-gpu/ \ --dtype float16 \ --use_weight_only \ --weight_only_precision int8

- build engine

trtllm-build --checkpoint_dir ./tmp/seallm/7B/checkpoint_merged_LoRA/int8/1-gpu/ \ --output_dir ./tmp/seallm/7B/engine_merged_LoRA/int8/1-gpu/ \ --max_beam_width 3 \ --gemm_plugin float16 \ --context_fmha enable \ --paged_kv_cache enable \ --remove_input_padding enable \ --gpt_attention_plugin float16 \ --multi_block_mode enable \ --use_paged_context_fmha enable


### Expected behavior

The engine is built successfully.

### actual behavior

This is some part of error.

.... ThreadblockSwizzle_ = cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<1>; KernelArch = cutlass::arch::Sm89; __nv_bool SplitKSerial = true] not implemented [TensorRT-LLM][WARNING] Have not found any valid GEMM config for shape (m=1, n=6144, k=4096). Will try to use default or fail at runtime
[TensorRT-LLM][WARNING] Have not found any valid GEMM config for shape (m=2, n=6144, k=4096). Will try to use default or fail at runtime
[TensorRT-LLM][WARNING] Have not found any valid GEMM config for shape (m=4, n=6144, k=4096). Will try to use default or fail at runtime
[TensorRT-LLM][WARNING] Have not found any valid GEMM config for shape (m=8, n=6144, k=4096). Will try to use default or fail at runtime
[TensorRT-LLM][WARNING] Have not found any valid GEMM config for shape (m=16, n=6144, k=4096). Will try to use default or fail at runtime
[TensorRT-LLM][WARNING] Have not found any valid GEMM config for shape (m=32, n=6144, k=4096). Will try to use default or fail at runtime
[TensorRT-LLM][WARNING] Have not found any valid GEMM config for shape (m=64, n=6144, k=4096). Will try to use default or fail at runtime
[TensorRT-LLM][WARNING] Have not found any valid GEMM config for shape (m=128, n=6144, k=4096). Will try to use default or fail at runtime
[TensorRT-LLM][WARNING] Have not found any valid GEMM config for shape (m=256, n=6144, k=4096). Will try to use default or fail at runtime
[TensorRT-LLM][WARNING] Have not found any valid GEMM config for shape (m=512, n=6144, k=4096). Will try to use default or fail at runtime
[TensorRT-LLM][WARNING] Have not found any valid GEMM config for shape (m=1024, n=6144, k=4096). Will try to use default or fail at runtime
terminate called after throwing an instance of 'tensorrt_llm::common::TllmException'
what(): [TensorRT-LLM][ERROR] Assertion failed: Can't free tmp workspace for GEMM tactics profiling. (/src/tensorrt_llm/cpp/tensorrt_llm/plugins/common/gemmPluginProfiler.cpp:181)
1 0x7f74ca703242 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(+0x57242) [0x7f74ca703242]
2 0x7f74ca7c1198 tensorrt_llm::plugins::GemmPluginProfiler<tensorrt_llm::cutlass_extensions::CutlassGemmConfig, std::shared_ptr, tensorrt_llm::plugins::GemmIdCore, tensorrt_llm::plugins::GemmIdCoreHash>::freeTmpData() + 104
3 0x7f74ca7ca9ca tensorrt_llm::plugins::GemmPluginProfiler<tensorrt_llm::cutlass_extensions::CutlassGemmConfig, std::shared_ptr, tensorrt_llm::plugins::GemmIdCore, tensorrt_llm::plugins::GemmIdCoreHash>::profileTactics(std::shared_ptr const&, nvinfer1::DataType const&, tensorrt_llm::plugins::GemmDims const&, tensorrt_llm::plugins::GemmIdCore const&) + 1066
4 0x7f74ca7a0cbd tensorrt_llm::plugins::WeightOnlyQuantMatmulPlugin::initialize() + 13
5 0x7f7608599a25 /usr/local/tensorrt/lib/libnvinfer.so.10(+0x1065a25) [0x7f7608599a25]
6 0x7f76085260aa /usr/local/tensorrt/lib/libnvinfer.so.10(+0xff20aa) [0x7f76085260aa]
7 0x7f7608312fcf /usr/local/tensorrt/lib/libnvinfer.so.10(+0xddefcf) [0x7f7608312fcf]
8 0x7f760831507c /usr/local/tensorrt/lib/libnvinfer.so.10(+0xde107c) [0x7f760831507c]
9 0x7f7608317071 /usr/local/tensorrt/lib/libnvinfer.so.10(+0xde3071) [0x7f7608317071]
10 0x7f7607f5c61c /usr/local/tensorrt/lib/libnvinfer.so.10(+0xa2861c) [0x7f7607f5c61c]
11 0x7f7607f61837 /usr/local/tensorrt/lib/libnvinfer.so.10(+0xa2d837) [0x7f7607f61837]
12 0x7f7607f621af /usr/local/tensorrt/lib/libnvinfer.so.10(+0xa2e1af) [0x7f7607f621af]
13 0x7f7615ea6478 /usr/local/lib/python3.10/dist-packages/tensorrt/tensorrt.so(+0xa6478) [0x7f7615ea6478]
14 0x7f7615e457a3 /usr/local/lib/python3.10/dist-packages/tensorrt/tensorrt.so(+0x457a3) [0x7f7615e457a3]
15 0x55afc1b9810e /usr/bin/python(+0x15a10e) [0x55afc1b9810e]
16 0x55afc1b8ea7b _PyObject_MakeTpCall + 603
17 0x55afc1ba6acb /usr/bin/python(+0x168acb) [0x55afc1ba6acb]
18 0x55afc1b86cfa _PyEval_EvalFrameDefault + 24906
19 0x55afc1b989fc _PyFunction_Vectorcall + 124
20 0x55afc1b835d7 _PyEval_EvalFrameDefault + 10791
21 0x55afc1b989fc _PyFunction_Vectorcall + 124
22 0x55afc1b8145c _PyEval_EvalFrameDefault + 2220
23 0x55afc1b989fc _PyFunction_Vectorcall + 124
24 0x55afc1b8126d _PyEval_EvalFrameDefault + 1725
25 0x55afc1b989fc _PyFunction_Vectorcall + 124
26 0x55afc1ba7492 PyObject_Call + 290
27 0x55afc1b835d7 _PyEval_EvalFrameDefault + 10791
28 0x55afc1b989fc _PyFunction_Vectorcall + 124
29 0x55afc1ba7492 PyObject_Call + 290
30 0x55afc1b835d7 _PyEval_EvalFrameDefault + 10791
31 0x55afc1b989fc _PyFunction_Vectorcall + 124
32 0x55afc1ba7492 PyObject_Call + 290
33 0x55afc1b835d7 _PyEval_EvalFrameDefault + 10791
34 0x55afc1b989fc _PyFunction_Vectorcall + 124
35 0x55afc1b8126d _PyEval_EvalFrameDefault + 1725
36 0x55afc1b7d9c6 /usr/bin/python(+0x13f9c6) [0x55afc1b7d9c6]
37 0x55afc1c73256 PyEval_EvalCode + 134
38 0x55afc1c9e108 /usr/bin/python(+0x260108) [0x55afc1c9e108]
39 0x55afc1c979cb /usr/bin/python(+0x2599cb) [0x55afc1c979cb]
40 0x55afc1c9de55 /usr/bin/python(+0x25fe55) [0x55afc1c9de55]
41 0x55afc1c9d338 _PyRun_SimpleFileObject + 424
42 0x55afc1c9cf83 _PyRun_AnyFileObject + 67
43 0x55afc1c8fa5e Py_RunMain + 702
44 0x55afc1c6602d Py_BytesMain + 45
45 0x7f767409ed90 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7f767409ed90]
46 0x7f767409ee40 libc_start_main + 128
47 0x55afc1c65f25 _start + 37
[cu-set-release:61857] Process received signal
[cu-set-release:61857] Signal: Aborted (6)
[cu-set-release:61857] Signal code: (-6)
[cu-set-release:61857] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f76740b7520]
[cu-set-release:61857] [ 1] /usr/lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7f767410b9fc]
[cu-set-release:61857] [ 2] /usr/lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7f76740b7476]
[cu-set-release:61857] [ 3] /usr/lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x7f767409d7f3]
[cu-set-release:61857] [ 4] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xa2b9e)[0x7f7604c0db9e]
[cu-set-release:61857] [ 5] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xae20c)[0x7f7604c1920c]
[cu-set-release:61857] [ 6] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xad1e9)[0x7f7604c181e9]
[cu-set-release:61857] [ 7] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(
gxx_personality_v0+0x99)[0x7f7604c18959]
[cu-set-release:61857] [ 8] /usr/lib/x86_64-linux-gnu/libgcc_s.so.1(+0x16884)[0x7f7673c88884]
[cu-set-release:61857] [ 9] /usr/lib/x86_64-linux-gnu/libgcc_s.so.1(_Unwind_Resume+0x12d)[0x7f7673c892dd]
[cu-set-release:61857] [10] /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(_ZN12tensorrt_llm7plugins18GemmPluginProfilerINS_18cutlass_extensions17CutlassGemmConfigESt10shared_ptrINS_7kernels15cutlass_kernels33CutlassFpAIntBGemmRunnerInterfaceEENS0_10GemmIdCoreENS0_14GemmIdCoreHashEE14profileTacticsERKS8_RKN8nvinfer18DataTypeERKNS08GemmDimsERKS9+0x75c)[0x7f74ca7cacfc] [cu-set-release:61857] [11] /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(_ZN12tensorrt_llm7plugins27WeightOnlyQuantMatmulPlugin10initializeEv+0xd)[0x7f74ca7a0cbd]
[cu-set-release:61857] [12] /usr/local/tensorrt/lib/libnvinfer.so.10(+0x1065a25)[0x7f7608599a25]
[cu-set-release:61857] [13] /usr/local/tensorrt/lib/libnvinfer.so.10(+0xff20aa)[0x7f76085260aa]
[cu-set-release:61857] [14] /usr/local/tensorrt/lib/libnvinfer.so.10(+0xddefcf)[0x7f7608312fcf]
[cu-set-release:61857] [15] /usr/local/tensorrt/lib/libnvinfer.so.10(+0xde107c)[0x7f760831507c]
[cu-set-release:61857] [16] /usr/local/tensorrt/lib/libnvinfer.so.10(+0xde3071)[0x7f7608317071]
[cu-set-release:61857] [17] /usr/local/tensorrt/lib/libnvinfer.so.10(+0xa2861c)[0x7f7607f5c61c]
[cu-set-release:61857] [18] /usr/local/tensorrt/lib/libnvinfer.so.10(+0xa2d837)[0x7f7607f61837]
[cu-set-release:61857] [19] /usr/local/tensorrt/lib/libnvinfer.so.10(+0xa2e1af)[0x7f7607f621af]
[cu-set-release:61857] [20] /usr/local/lib/python3.10/dist-packages/tensorrt/tensorrt.so(+0xa6478)[0x7f7615ea6478]
[cu-set-release:61857] [21] /usr/local/lib/python3.10/dist-packages/tensorrt/tensorrt.so(+0x457a3)[0x7f7615e457a3]
[cu-set-release:61857] [22] /usr/bin/python(+0x15a10e)[0x55afc1b9810e]
[cu-set-release:61857] [23] /usr/bin/python(_PyObject_MakeTpCall+0x25b)[0x55afc1b8ea7b]
[cu-set-release:61857] [24] /usr/bin/python(+0x168acb)[0x55afc1ba6acb]
[cu-set-release:61857] [25] /usr/bin/python(_PyEval_EvalFrameDefault+0x614a)[0x55afc1b86cfa]
[cu-set-release:61857] [26] /usr/bin/python(_PyFunction_Vectorcall+0x7c)[0x55afc1b989fc]
[cu-set-release:61857] [27] /usr/bin/python(_PyEval_EvalFrameDefault+0x2a27)[0x55afc1b835d7]
[cu-set-release:61857] [28] /usr/bin/python(_PyFunction_Vectorcall+0x7c)[0x55afc1b989fc]
[cu-set-release:61857] [29] /usr/bin/python(_PyEval_EvalFrameDefault+0x8ac)[0x55afc1b8145c]
[cu-set-release:61857] End of error message
Aborted (core dumped)



### additional notes

- Whenever I try to quantize with `int4_awq, int8_wo, int4_wo`, or any setting of the `weight_only_precision flag`, I got the same errors during engine building.
nv-guomingz commented 3 months ago

@nekorobov would u please take a look on it?

nekorobov commented 2 months ago

@Naphat-Khoprasertthaworn thank you for reporting the problem. This failure happens dues to lack of memory for the gemm plugins profiler. You can try to reduce the number of tokens and batch size. E.g. set --max_num_tokens 1024 and/or --max_batch_size 32. Let me know if it helps.

github-actions[bot] commented 1 month ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."