NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.62k stars 978 forks source link

[TensorRT-LLM][ERROR] Encountered an error in forward function: [TensorRT-LLM][ERROR] Assertion failed: Sequence Length is too long for the batchApplyRepetitionPenalty kernel (not enough shared memory). (/app/tensorrt_llm/cpp/tensorrt_llm/kernels/samplingPenaltyKernels.cu:277) #1251

Open BasicCoder opened 8 months ago

BasicCoder commented 8 months ago

System Info

CPU: X86_64 GPU: 4*A100 80G TensorRT-LLM: 0.6.1

Who can help?

@kaiyux @byshiue

Information

Tasks

Reproduction

Using TRT-LLM v0.6.1 version.

1. model convert
python ../examples/llama/build.py \
                --world_size 4 \
                --tp_size 4 \
                --model_dir /workspace/llama-70b/ \
                --dtype float16 \
                --vocab_size 39424 \
                --max_batch_size 4 \
                --max_input_len 98000 \
                --max_output_len 4096 \
                --max_beam_width 1 \
                --rotary_base 500000 \
                --rotary_scaling dynamic 4.0 \
                --use_gpt_attention_plugin float16 \
                --use_gemm_plugin float16 \
                --enable_context_fmha \
                --multi_block_mode \
                --use_parallel_embedding \
                --embedding_sharding_dim 0 \
                --use_inflight_batching \
                --paged_kv_cache \
                --remove_input_padding \
                --output_dir /workspace/llama-70b/trt_engines/fp16/4-gpu/ > llama_70b_convert_tp4.log 2>&1
2. Deploy this model using triton
3. Set the request parameter repetition_penalty=1.1, send request to triton server

Expected behavior

Get the correct return result.

actual behavior

[TensorRT-LLM][ERROR] Encountered an error in forward function: [TensorRT-LLM][ERROR] Assertion failed: Sequence Length is too long for the batchApplyRepetitionPenalty kernel (not enough shared memory). (/app/tensorrt_llm/cpp/tensorrt_llm/kernels/samplingPenaltyKernels.cu:277)
1       0x7f7f73b342b6 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x162b6) [0x7f7f73b342b6]
2       0x7f7f73cc1c75 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x1a3c75) [0x7f7f73cc1c75]
3       0x7f7f73cb8893 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x19a893) [0x7f7f73cb8893]
4       0x7f7f73c59d9f /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x13bd9f) [0x7f7f73c59d9f]
5       0x7f7f73c3fe67 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x121e67) [0x7f7f73c3fe67]
6       0x7f7f73bd3dc6 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0xb5dc6) [0x7f7f73bd3dc6]
7       0x7f7f73b90860 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x72860) [0x7f7f73b90860]
8       0x7f7f73b92521 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x74521) [0x7f7f73b92521]
9       0x7f7f73b7ff44 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x61f44) [0x7f7f73b7ff44]
10      0x7f7f73b82aef /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x64aef) [0x7f7f73b82aef]
11      0x7f804a464253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f804a464253]
12      0x7f804a1f4ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f804a1f4ac3]
13      0x7f804a285bf4 clone + 68

additional notes

I checked the code, and the same code is included in versions 0.6.1~0.7.1 https://github.com/NVIDIA/TensorRT-LLM/blob/v0.6.1/cpp/tensorrt_llm/kernels/samplingPenaltyKernels.cu#L271 .

This error only occurs when repetition_penalty=1.1, and there is no error when repetition_penalty=1.0. This may be because the input length that the model needs to process is 100k, which exceeds smemSize. How can I get around this length limitation?

byshiue commented 8 months ago

Could you try the latest main branch? The issue is fixed in latest main branch by preventing using shared memory in penalty kernel.

BasicCoder commented 8 months ago

Could you try the latest main branch? The issue is fixed in latest main branch by preventing using shared memory in penalty kernel.

Thanks for your help. Has this issue been fixed in TRT-LLM v0.8.0 version?

byshiue commented 8 months ago

Yes, the issue is also fixed in v0.8.0.