NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.3k stars 927 forks source link

Model goes into unusable state when passing specific LoRA as input to trtllm model with LoRA support #1781

Open pankajroark opened 3 months ago

pankajroark commented 3 months ago

System Info

x86_64, NVIDIA A100 80GB, TensorRT-LLM v0.10.0

Who can help?

@ncomly-nvidia

Information

Tasks

Reproduction

  1. Build model using steps below
    
    BASE_LLAMA_MODEL=mistral
    huggingface-cli download mistralai/Mistral-7B-Instruct-v0.2 --local-dir $BASE_LLAMA_MODEL

python convert_checkpoint.py --model_dir ${BASE_LLAMA_MODEL} \ --output_dir ./tllm_checkpoint_1gpu \ --dtype float16 trtllm-build --checkpoint_dir ./tllm_checkpoint_1gpu \ --output_dir ./engine/mistral/i1600-o600-bs96-tp1-fp16-lora \ --gemm_plugin float16 \ --max_batch_size 96 \ --max_input_len 1600 \ --max_output_len 600 \ --gpt_attention_plugin float16 \ --paged_kv_cache enable \ --remove_input_padding enable \ --use_paged_context_fmha enable \ --use_custom_all_reduce disable \ --lora_plugin float16 \ --lora_target_modules attn_q attn_k attn_v attn_dense \ --max_lora_rank 16


3. Invoke the model with following to make sure the model works:
   - prompt: "once upon a time"
   - lora_hf_repo: kunishou/Japanese-Alpaca-LoRA-7b-v0
4. After making sure the above works well, try with this LoRA instead
   - prompt: "once upon a time"
   - lora_hf_repo: Tsukitsune/alpaca_7b_lora
   This fails
5. Try the request that was working before. Even this request fails. Basically,
   all requests start failing after this point.

### Expected behavior

The supplied LoRA in the request is incompatible with the model, that request failing is fine. But the model shouldn't go into a bad state afterwards.

### actual behavior

Model goes into an unusable state, failing all requests afterwards. The bad requests acts like a poison pill.

### additional notes

Error stack trace:
```sh
"in ensemble 'ensemble', Executor failed process requestId 4 due to the following error: Encountered an error in forwardAsync function: [TensorRT-LLM][ERROR] CUDA runtime error in cudaMemcpyAsync(tgt, src, sizeof(T) * size, cudaMemcpyDefault, stream): misaligned address (/app/tensorrt_llm/cpp/tensorrt_llm/common/memoryUtils.cu:211)\
1       0x7f48c7010c36 void tensorrt_llm::common::cudaAutoCpy<bool>(bool*, bool const*, unsigned long, CUstream_st*) + 214\
2       0x7f48c8b9290e tensorrt_llm::layers::TopKSamplingLayer<float>::setup(int, int, int const*, std::shared_ptr<tensorrt_llm::layers::BaseSetupParams>) + 1678\
3       0x7f48c8b7f8f2 tensorrt_llm::layers::SamplingLayer<float>::setup(int, int, int const*, std::shared_ptr<tensorrt_llm::layers::BaseSetupParams>) + 370\
4       0x7f48c8b4a6fe tensorrt_llm::layers::DecodingLayer<float>::setup(int, int, int const*, std::shared_ptr<tensorrt_llm::layers::BaseSetupParams>) + 910\
5       0x7f48c8b5d6a3 tensorrt_llm::layers::DynamicDecodeLayer<float>::setup(int, int, int const*, std::shared_ptr<tensorrt_llm::layers::BaseSetupParams>) + 275\
6       0x7f48c8be3bbb tensorrt_llm::runtime::GptDecoder<float>::setup(tensorrt_llm::runtime::SamplingConfig const&, unsigned long, int, std::optional<std::shared_ptr<tensorrt_llm::runtime::ITensor> > const&) + 827\
7       0x7f48c8bf22dc tensorrt_llm::runtime::GptDecoderBatch::newRequests(std::vector<int, std::allocator<int> > const&, std::vector<tensorrt_llm::runtime::decoder_batch::Request, std::allocator<tensorrt_llm::runtime::decoder_batch::Request> > const&, std::vector<tensorrt_llm::runtime::SamplingConfig, std::allocator<tensorrt_llm::runtime::SamplingConfig> > const&) + 492\
8       0x7f48c8e8eb9e tensorrt_llm::batch_manager::TrtGptModelInflightBatching::setupDecoderStep(std::vector<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&) + 734\
9       0x7f48c8e90694 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::forwardAsync(std::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&) + 2212\
10      0x7f48c8eb47e4 tensorrt_llm::executor::Executor::Impl::forwardAsync(std::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&) + 100\
11      0x7f48c8eb6a6c tensorrt_llm::executor::Executor::Impl::executionLoop() + 380\
12      0x7f49a67e1253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f49a67e1253]\
13      0x7f49a6570ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f49a6570ac3]\
14      0x7f49a6602850 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x126850) [0x7f49a6602850]```
hijkzzz commented 3 months ago

Could you try pip install tensorrt_llm== 0.11.0.dev2024061100 And also provide the invoking scripts. Thanks

pankajroark commented 3 months ago

Tried with 0.11.0.dev2024061100 and the issue still persists.

Invoking script (from examples/llama in TensorRT-LLM git repo): Build engine

BASE_LLAMA_MODEL=mistral
huggingface-cli download mistralai/Mistral-7B-Instruct-v0.2 --local-dir $BASE_LLAMA_MODEL

python3 convert_checkpoint.py --model_dir ${BASE_LLAMA_MODEL} \
                            --output_dir ./tllm_checkpoint_1gpu \
                            --dtype float16
trtllm-build --checkpoint_dir ./tllm_checkpoint_1gpu \
            --output_dir ./engine/mistral/i1600-o600-bs96-tp1-fp16-lora \
            --gemm_plugin float16 \
            --max_batch_size 96 \
            --max_input_len 1600 \
            --max_output_len 600 \
            --gpt_attention_plugin float16 \
            --paged_kv_cache enable \
            --remove_input_padding enable \
            --use_paged_context_fmha enable \
            --use_custom_all_reduce disable \
            --lora_plugin float16 \
            --lora_target_modules attn_q attn_k attn_v attn_dense \
            --max_lora_rank 16

Build LoRA and invoke:

huggingface-cli download Tsukitsune/alpaca_7b_lora --local-dir lora
python3 ../hf_lora_convert.py -i lora -o Tsukitsune-alpaca_7b_lora-weights --storage-type float16

python3 ../run.py --max_output_len=50 \
               --tokenizer_dir ./mistral/ \
               --engine_dir=./engine/mistral/i1600-o600-bs96-tp1-fp16-lora \
               --use_py_session \
               --lora_task_uids=0 \
               --lora_dir=lora

The error is:

terminate called after throwing an instance of 'tensorrt_llm::common::TllmException'
  what():  [TensorRT-LLM][ERROR] CUDA runtime error in cublasLtMatmul(getCublasLtHandle(), mOperationDesc, alpha, A, mADesc, B, mBDesc, beta, C, mCDesc, C, mCDesc, (hasAlgo ? (&algo) : NULL), mCublasWorkspace, workspaceSize, mStream): CUBLAS_STATUS_NOT_SUPPORTED (/home/jenkins/agent/workspace/LLM/main/L0_PostMerge/tensorrt_llm/cpp/tensorrt_llm/common/cublasMMWrapper.cpp:157)
1       0x7f6ba0718329 void tensorrt_llm::common::check<cublasStatus_t>(cublasStatus_t, char const*, char const*, int) + 121
2       0x7f6ba0716a79 tensorrt_llm::common::CublasMMWrapper::Gemm(cublasOperation_t, cublasOperation_t, int, int, int, void const*, int, void const*, int, void*, int, float, float, cublasLtMatmulAlgo_t const&, bool, bool) + 281
3       0x7f6ba0717004 tensorrt_llm::common::CublasMMWrapper::Gemm(cublasOperation_t, cublasOperation_t, int, int, int, void const*, int, void const*, int, void*, int, float, float, std::optional<cublasLtMatmulHeuristicResult_t> const&) + 84
4       0x7f6b554bca1f /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(+0x11da1f) [0x7f6b554bca1f]
5       0x7f6b554bd632 tensorrt_llm::plugins::GemmPlugin::enqueue(nvinfer1::PluginTensorDesc const*, nvinfer1::PluginTensorDesc const*, void const* const*, void* const*, void*, CUstream_st*) + 2450
6       0x7f6c729c7a8c /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.10(+0x109fa8c) [0x7f6c729c7a8c]
7       0x7f6c7296c657 /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.10(+0x1044657) [0x7f6c7296c657]
8       0x7f6c7296e0c1 /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.10(+0x10460c1) [0x7f6c7296e0c1]
9       0x7f6c1d6a48f0 /usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so(+0xa48f0) [0x7f6c1d6a48f0]
10      0x7f6c1d6458f3 /usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so(+0x458f3) [0x7f6c1d6458f3]
11      0x55f7e2cac10e python3(+0x15a10e) [0x55f7e2cac10e]
12      0x55f7e2ca2a7b _PyObject_MakeTpCall + 603
13      0x55f7e2cbaacb python3(+0x168acb) [0x55f7e2cbaacb]
14      0x55f7e2c9acfa _PyEval_EvalFrameDefault + 24906
15      0x55f7e2cac9fc _PyFunction_Vectorcall + 124
16      0x55f7e2c9545c _PyEval_EvalFrameDefault + 2220
17      0x55f7e2cba93e python3(+0x16893e) [0x55f7e2cba93e]
18      0x55f7e2c975d7 _PyEval_EvalFrameDefault + 10791
19      0x55f7e2cba93e python3(+0x16893e) [0x55f7e2cba93e]
20      0x55f7e2c975d7 _PyEval_EvalFrameDefault + 10791
21      0x55f7e2cac9fc _PyFunction_Vectorcall + 124
22      0x55f7e2cbb492 PyObject_Call + 290
23      0x55f7e2c975d7 _PyEval_EvalFrameDefault + 10791
24      0x55f7e2cba7f1 python3(+0x1687f1) [0x55f7e2cba7f1]
25      0x55f7e2cbb492 PyObject_Call + 290
26      0x55f7e2c975d7 _PyEval_EvalFrameDefault + 10791
27      0x55f7e2cba7f1 python3(+0x1687f1) [0x55f7e2cba7f1]
28      0x55f7e2cbb492 PyObject_Call + 290
29      0x55f7e2c975d7 _PyEval_EvalFrameDefault + 10791
30      0x55f7e2cac9fc _PyFunction_Vectorcall + 124
31      0x55f7e2c9526d _PyEval_EvalFrameDefault + 1725
32      0x55f7e2c919c6 python3(+0x13f9c6) [0x55f7e2c919c6]
33      0x55f7e2d87256 PyEval_EvalCode + 134
34      0x55f7e2db2108 python3(+0x260108) [0x55f7e2db2108]
35      0x55f7e2dab9cb python3(+0x2599cb) [0x55f7e2dab9cb]
36      0x55f7e2db1e55 python3(+0x25fe55) [0x55f7e2db1e55]
37      0x55f7e2db1338 _PyRun_SimpleFileObject + 424
38      0x55f7e2db0f83 _PyRun_AnyFileObject + 67
39      0x55f7e2da3a5e Py_RunMain + 702
40      0x55f7e2d7a02d Py_BytesMain + 45
41      0x7f6dd77c8d90 /lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7f6dd77c8d90]
42      0x7f6dd77c8e40 __libc_start_main + 128
43      0x55f7e2d79f25 _start + 37
[a86872ebeb62:11430] *** Process received signal ***
[a86872ebeb62:11430] Signal: Aborted (6)
[a86872ebeb62:11430] Signal code:  (-6)
[a86872ebeb62:11430] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f6dd77e1520]
[a86872ebeb62:11430] [ 1] /lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7f6dd78359fc]
[a86872ebeb62:11430] [ 2] /lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7f6dd77e1476]
[a86872ebeb62:11430] [ 3] /lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x7f6dd77c77f3]
[a86872ebeb62:11430] [ 4] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xa2b9e)[0x7f6d35076b9e]
[a86872ebeb62:11430] [ 5] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xae20c)[0x7f6d3508220c]
[a86872ebeb62:11430] [ 6] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xad1e9)[0x7f6d350811e9]
[a86872ebeb62:11430] [ 7] /lib/x86_64-linux-gnu/libstdc++.so.6(__gxx_personality_v0+0x99)[0x7f6d35081959]
[a86872ebeb62:11430] [ 8] /lib/x86_64-linux-gnu/libgcc_s.so.1(+0x16884)[0x7f6dd74d1884]
[a86872ebeb62:11430] [ 9] /lib/x86_64-linux-gnu/libgcc_s.so.1(_Unwind_Resume+0x12d)[0x7f6dd74d22dd]
[a86872ebeb62:11430] [10] /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(+0x7205dd)[0x7f6ba05f85dd]
[a86872ebeb62:11430] [11] /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(_ZN12tensorrt_llm6common15CublasMMWrapper4GemmE17cublasOperation_tS2_iiiPKviS4_iPviffRKSt8optionalI31cublasLtMatmulHeuristicResult_tE+0x54)[0x7f6ba0717004]
[a86872ebeb62:11430] [12] /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(+0x11da1f)[0x7f6b554bca1f]
[a86872ebeb62:11430] [13] /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(_ZN12tensorrt_llm7plugins10GemmPlugin7enqueueEPKN8nvinfer116PluginTensorDescES5_PKPKvPKPvSA_P11CUstream_st+0x992)[0x7f6b554bd632]
[a86872ebeb62:11430] [14] /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.10(+0x109fa8c)[0x7f6c729c7a8c]
[a86872ebeb62:11430] [15] /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.10(+0x1044657)[0x7f6c7296c657]
[a86872ebeb62:11430] [16] /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.10(+0x10460c1)[0x7f6c7296e0c1]
[a86872ebeb62:11430] [17] /usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so(+0xa48f0)[0x7f6c1d6a48f0]
[a86872ebeb62:11430] [18] /usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so(+0x458f3)[0x7f6c1d6458f3]
[a86872ebeb62:11430] [19] python3(+0x15a10e)[0x55f7e2cac10e]
[a86872ebeb62:11430] [20] python3(_PyObject_MakeTpCall+0x25b)[0x55f7e2ca2a7b]
[a86872ebeb62:11430] [21] python3(+0x168acb)[0x55f7e2cbaacb]
[a86872ebeb62:11430] [22] python3(_PyEval_EvalFrameDefault+0x614a)[0x55f7e2c9acfa]
[a86872ebeb62:11430] [23] python3(_PyFunction_Vectorcall+0x7c)[0x55f7e2cac9fc]
[a86872ebeb62:11430] [24] python3(_PyEval_EvalFrameDefault+0x8ac)[0x55f7e2c9545c]
[a86872ebeb62:11430] [25] python3(+0x16893e)[0x55f7e2cba93e]
[a86872ebeb62:11430] [26] python3(_PyEval_EvalFrameDefault+0x2a27)[0x55f7e2c975d7]
[a86872ebeb62:11430] [27] python3(+0x16893e)[0x55f7e2cba93e]
[a86872ebeb62:11430] [28] python3(_PyEval_EvalFrameDefault+0x2a27)[0x55f7e2c975d7]
[a86872ebeb62:11430] [29] python3(_PyFunction_Vectorcall+0x7c)[0x55f7e2cac9fc]
[a86872ebeb62:11430] *** End of error message ***

cc @hijkzzz

pankajroark commented 3 months ago

Please note that I've provided the requested information. The issue is still labeled as waiting for feedback.

hijkzzz commented 3 months ago

We are working on solving the issue

github-actions[bot] commented 2 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."