NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
7.35k stars 793 forks source link

Run LLaMa2 with LoRA on V100 failed #1862

Open cxz91493 opened 3 days ago

cxz91493 commented 3 days ago

System Info

Who can help?

No response

Information

Tasks

Reproduction

I follow the steps here https://github.com/NVIDIA/TensorRT-LLM/tree/v0.10.0/examples/llama#run-llama-with-lora

  1. Download base model and lora model from HF Base model: Llama-2-7b-hf Lora model: chinese-llama-2-lora-7b

  2. Run on basic docker image environment

    $ sudo docker run --rm --runtime=nvidia --gpus all --entrypoint /bin/bash -it nvidia/cuda:12.4.0-devel-ubuntu22.04
  3. Convert model

    python3 convert_checkpoint.py --model_dir Llama-2-7b-hf \
    --output_dir ./tllm_checkpoint_1gpu \
    --dtype float16 \
    --tp_size 1
  4. Build engine

    trtllm-build --checkpoint_dir ./tllm_checkpoint_1gpu \
    --output_dir /mnt/tensorRT_0.10.0/TensorRT-LLM/examples/llama/trt_engines/ \
    --gemm_plugin float16 \
    --lora_plugin float16 \
    --max_batch_size 1 \
    --max_input_len 512 \
    --max_output_len 50 \
    --lora_dir chinese-llama-2-lora-7b
  5. Run the model

    • Run on base model(--lora_task_uids -1) -> work!
      
      mpirun --allow-run-as-root -n 1 python3 ../run.py --engine_dir "/mnt/tensorRT_0.10.0/TensorRT-LLM/examples/llama/trt_engines/" \
      --max_output_len 50 \
      --tokenizer_dir "chinese-llama-2-lora-7b/" \
      --input_text "今天天气很好,我到公园的时候," \
      --lora_task_uids -1 \
      --no_add_special_tokens \
      --use_py_session

""" [TensorRT-LLM] TensorRT-LLM version: 0.10.0 [06/28/2024-08:17:02] [TRT-LLM] [W] The paged KV cache in Python runtime is experimental. For performance and correctness, please, use C++ runtime. /usr/local/lib/python3.10/dist-packages/torch/nested/init.py:166: UserWarning: The PyTorch API of nested tensors is in prototype stage and will change in the near future. (Triggered internally at ../aten/src/ATen/NestedTensorImpl.cpp:177.) return _nested.nested_tensor( Input [Text 0]: "今天天气很好,我到公园的时候," Output [Text 0 Beam 0]: "sitting beside rivers surrounded surrounded surrounded surrounded surrounded surrounded surrounded surrounded surrounded surrounded surrounded surrounded surrounded surrounded surrounded surrounded surrounded surrounded surrounded surrounded surrounded surrounded surrounded surrounded surrounded surrounded surrounded surrounded surrounded surrounded surrounded surrounded surrounded surrounded surrounded surrounded surrounded surrounded surrounded surrounded surrounded surrounded surrounded surrounded surrounded surrounded surrounded" """

 **- Run on lora model(--lora_task_uids 0) -> Got error!**
```shell
mpirun --allow-run-as-root -n 1 python3 ../run.py --engine_dir "/mnt/tensorRT_0.10.0/TensorRT-LLM/examples/llama/trt_engines/" \
    --max_output_len 50 \
    --tokenizer_dir "chinese-llama-2-lora-7b/" \
    --input_text "今天天气很好,我到公园的时候," \
    --lora_task_uids 0 \
    --no_add_special_tokens \
    --use_py_session

Expected behavior

Sussessfuly get output by lora model

actual behavior

Error occured

terminate called after throwing an instance of 'tensorrt_llm::common::TllmException'
  what():  [TensorRT-LLM][ERROR] Assertion failed: Failed to run CUTLASS Grouped GEMM kernel. (/home/jenkins/agent/workspace/LLM/release-0.10/L0_PostMerge/tensorrt_llm/cpp/tensorrt_llm/kernels/groupGemm.cu:167)
1       0x7efcdfce32bf /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(+0x572bf) [0x7efcdfce32bf]
2       0x7efcdfe39e70 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(+0x1ade70) [0x7efcdfe39e70]
3       0x7efcdfd805a3 tensorrt_llm::plugins::LoraPlugin::enqueue(nvinfer1::PluginTensorDesc const*, nvinfer1::PluginTensorDesc const*, void const* const*, void* const*, void*, CUstream_st*) + 5859
4       0x7efde89c7a8c /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.10(+0x109fa8c) [0x7efde89c7a8c]
5       0x7efde896c657 /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.10(+0x1044657) [0x7efde896c657]
6       0x7efde896e0c1 /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.10(+0x10460c1) [0x7efde896e0c1]
7       0x7efd936a48f0 /usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so(+0xa48f0) [0x7efd936a48f0]
8       0x7efd936458f3 /usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so(+0x458f3) [0x7efd936458f3]
9       0x5649586c710e python3(+0x15a10e) [0x5649586c710e]
10      0x5649586bda7b _PyObject_MakeTpCall + 603
11      0x5649586d5acb python3(+0x168acb) [0x5649586d5acb]
12      0x5649586b5cfa _PyEval_EvalFrameDefault + 24906
13      0x5649586c79fc _PyFunction_Vectorcall + 124
14      0x5649586b045c _PyEval_EvalFrameDefault + 2220
15      0x5649586d593e python3(+0x16893e) [0x5649586d593e]
16      0x5649586b25d7 _PyEval_EvalFrameDefault + 10791
17      0x5649586d593e python3(+0x16893e) [0x5649586d593e]
18      0x5649586b25d7 _PyEval_EvalFrameDefault + 10791
19      0x5649586c79fc _PyFunction_Vectorcall + 124
20      0x5649586d6492 PyObject_Call + 290
21      0x5649586b25d7 _PyEval_EvalFrameDefault + 10791
22      0x5649586d57f1 python3(+0x1687f1) [0x5649586d57f1]
23      0x5649586d6492 PyObject_Call + 290
24      0x5649586b25d7 _PyEval_EvalFrameDefault + 10791
25      0x5649586d57f1 python3(+0x1687f1) [0x5649586d57f1]
26      0x5649586d6492 PyObject_Call + 290
27      0x5649586b25d7 _PyEval_EvalFrameDefault + 10791
28      0x5649586c79fc _PyFunction_Vectorcall + 124
29      0x5649586b026d _PyEval_EvalFrameDefault + 1725
30      0x5649586ac9c6 python3(+0x13f9c6) [0x5649586ac9c6]
31      0x5649587a2256 PyEval_EvalCode + 134
32      0x5649587cd108 python3(+0x260108) [0x5649587cd108]
33      0x5649587c69cb python3(+0x2599cb) [0x5649587c69cb]
34      0x5649587cce55 python3(+0x25fe55) [0x5649587cce55]
35      0x5649587cc338 _PyRun_SimpleFileObject + 424
36      0x5649587cbf83 _PyRun_AnyFileObject + 67
37      0x5649587bea5e Py_RunMain + 702
38      0x56495879502d Py_BytesMain + 45
39      0x7efeb0002d90 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7efeb0002d90]
40      0x7efeb0002e40 __libc_start_main + 128
41      0x564958794f25 _start + 37
[63fef0dfdd53:00937] *** Process received signal ***
[63fef0dfdd53:00937] Signal: Aborted (6)
[63fef0dfdd53:00937] Signal code:  (-6)
[63fef0dfdd53:00937] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7efeb001b520]
[63fef0dfdd53:00937] [ 1] /usr/lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7efeb006f9fc]
[63fef0dfdd53:00937] [ 2] /usr/lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7efeb001b476]                                                                                                                        [63fef0dfdd53:00937] [ 3] /usr/lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x7efeb00017f3]
[63fef0dfdd53:00937] [ 4] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xa2b9e)[0x7efeab076b9e]
[63fef0dfdd53:00937] [ 5] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xae20c)[0x7efeab08220c]
[63fef0dfdd53:00937] [ 6] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xad1e9)[0x7efeab0811e9]
[63fef0dfdd53:00937] [ 7] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(__gxx_personality_v0+0x99)[0x7efeab081959]
[63fef0dfdd53:00937] [ 8] /usr/lib/x86_64-linux-gnu/libgcc_s.so.1(+0x16884)[0x7efead357884]
[63fef0dfdd53:00937] [ 9] /usr/lib/x86_64-linux-gnu/libgcc_s.so.1(_Unwind_Resume+0x12d)[0x7efead3582dd]
[63fef0dfdd53:00937] [10] /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(+0x5ffa2)[0x7efcdfcebfa2]
[63fef0dfdd53:00937] [11] /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(_ZN12tensorrt_llm7plugins10LoraPlugin7enqueueEPKN8nvinfer116PluginTensorDescES5_PKPKvPKPvSA_P11CUstream_st+0x16e3)[0x7efcdfd805a3]
[63fef0dfdd53:00937] [12] /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.10(+0x109fa8c)[0x7efde89c7a8c]
[63fef0dfdd53:00937] [13] /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.10(+0x1044657)[0x7efde896c657]
[63fef0dfdd53:00937] [14] /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.10(+0x10460c1)[0x7efde896e0c1]
[63fef0dfdd53:00937] [15] /usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so(+0xa48f0)[0x7efd936a48f0]
[63fef0dfdd53:00937] [16] /usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so(+0x458f3)[0x7efd936458f3]
[63fef0dfdd53:00937] [17] python3(+0x15a10e)[0x5649586c710e]
[63fef0dfdd53:00937] [18] python3(_PyObject_MakeTpCall+0x25b)[0x5649586bda7b]
[63fef0dfdd53:00937] [19] python3(+0x168acb)[0x5649586d5acb]
[63fef0dfdd53:00937] [20] python3(_PyEval_EvalFrameDefault+0x614a)[0x5649586b5cfa]
[63fef0dfdd53:00937] [21] python3(_PyFunction_Vectorcall+0x7c)[0x5649586c79fc]
[63fef0dfdd53:00937] [22] python3(_PyEval_EvalFrameDefault+0x8ac)[0x5649586b045c]
[63fef0dfdd53:00937] [23] python3(+0x16893e)[0x5649586d593e]
[63fef0dfdd53:00937] [24] python3(_PyEval_EvalFrameDefault+0x2a27)[0x5649586b25d7]
[63fef0dfdd53:00937] [25] python3(+0x16893e)[0x5649586d593e]
[63fef0dfdd53:00937] [26] python3(_PyEval_EvalFrameDefault+0x2a27)[0x5649586b25d7]
[63fef0dfdd53:00937] [27] python3(_PyFunction_Vectorcall+0x7c)[0x5649586c79fc]
[63fef0dfdd53:00937] [28] python3(PyObject_Call+0x122)[0x5649586d6492]
[63fef0dfdd53:00937] [29] python3(_PyEval_EvalFrameDefault+0x2a27)[0x5649586b25d7]
[63fef0dfdd53:00937] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node 63fef0dfdd53 exited on signal 6 (Aborted).
--------------------------------------------------------------------------

additional notes

got tllm_checkpoint_1gpu after covert model

$ tree tllm_checkpoint_1gpu/
tllm_checkpoint_1gpu/
├── config.json
└── rank0.safetensors

got trt_engines folder after build engine

$ tree trt_engines/
trt_engines/
├── config.json
├── lora
│   └── 0
│       ├── adapter_config.json
│       └── adapter_model.bin
└── rank0.engine
byshiue commented 3 days ago

LoRA is not supported on V100. If you hope TRT-LLM support this feature, you can create a issue to ask this feature.

anigi98932 commented 3 days ago

i have the same error in V100

docker run --rm --runtime=nvidia --gpus all --entrypoint /bin/bash -it nvidia/cuda:12.4.0-devel-ubuntu22.04