[TensorRT-LLM][ERROR] Encountered an error in forward function: [TensorRT-LLM][ERROR] Assertion failed: Sequence Length is too long for the batchApplyRepetitionPenalty kernel (not enough shared memory). (/app/tensorrt_llm/cpp/tensorrt_llm/kernels/samplingPenaltyKernels.cu:277)

BasicCoder commented 8 months ago

System Info

CPU: X86_64 GPU: 4*A100 80G TensorRT-LLM: 0.6.1

Who can help?

@kaiyux @byshiue

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

Using TRT-LLM v0.6.1 version.

1. model convert
python ../examples/llama/build.py \
                --world_size 4 \
                --tp_size 4 \
                --model_dir /workspace/llama-70b/ \
                --dtype float16 \
                --vocab_size 39424 \
                --max_batch_size 4 \
                --max_input_len 98000 \
                --max_output_len 4096 \
                --max_beam_width 1 \
                --rotary_base 500000 \
                --rotary_scaling dynamic 4.0 \
                --use_gpt_attention_plugin float16 \
                --use_gemm_plugin float16 \
                --enable_context_fmha \
                --multi_block_mode \
                --use_parallel_embedding \
                --embedding_sharding_dim 0 \
                --use_inflight_batching \
                --paged_kv_cache \
                --remove_input_padding \
                --output_dir /workspace/llama-70b/trt_engines/fp16/4-gpu/ > llama_70b_convert_tp4.log 2>&1
2. Deploy this model using triton
3. Set the request parameter repetition_penalty=1.1, send request to triton server

Expected behavior

Get the correct return result.

actual behavior

[TensorRT-LLM][ERROR] Encountered an error in forward function: [TensorRT-LLM][ERROR] Assertion failed: Sequence Length is too long for the batchApplyRepetitionPenalty kernel (not enough shared memory). (/app/tensorrt_llm/cpp/tensorrt_llm/kernels/samplingPenaltyKernels.cu:277)
1       0x7f7f73b342b6 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x162b6) [0x7f7f73b342b6]
2       0x7f7f73cc1c75 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x1a3c75) [0x7f7f73cc1c75]
3       0x7f7f73cb8893 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x19a893) [0x7f7f73cb8893]
4       0x7f7f73c59d9f /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x13bd9f) [0x7f7f73c59d9f]
5       0x7f7f73c3fe67 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x121e67) [0x7f7f73c3fe67]
6       0x7f7f73bd3dc6 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0xb5dc6) [0x7f7f73bd3dc6]
7       0x7f7f73b90860 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x72860) [0x7f7f73b90860]
8       0x7f7f73b92521 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x74521) [0x7f7f73b92521]
9       0x7f7f73b7ff44 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x61f44) [0x7f7f73b7ff44]
10      0x7f7f73b82aef /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x64aef) [0x7f7f73b82aef]
11      0x7f804a464253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f804a464253]
12      0x7f804a1f4ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f804a1f4ac3]
13      0x7f804a285bf4 clone + 68

additional notes

I checked the code, and the same code is included in versions 0.6.1~0.7.1 https://github.com/NVIDIA/TensorRT-LLM/blob/v0.6.1/cpp/tensorrt_llm/kernels/samplingPenaltyKernels.cu#L271 .

This error only occurs when repetition_penalty=1.1, and there is no error when repetition_penalty=1.0. This may be because the input length that the model needs to process is 100k, which exceeds smemSize. How can I get around this length limitation?

byshiue commented 8 months ago

Could you try the latest main branch? The issue is fixed in latest main branch by preventing using shared memory in penalty kernel.

BasicCoder commented 8 months ago

Could you try the latest main branch? The issue is fixed in latest main branch by preventing using shared memory in penalty kernel.

Thanks for your help. Has this issue been fixed in TRT-LLM v0.8.0 version?

byshiue commented 8 months ago

Yes, the issue is also fixed in v0.8.0.

NVIDIA / TensorRT-LLM