NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.46k stars 957 forks source link

Get error "newSize <= getCapacity()" when call endpoint #375

Open activezhao opened 11 months ago

activezhao commented 11 months ago

I use the latest tensorrtllm_backend and TensorRT-LLM of main branch to get docker images. https://github.com/triton-inference-server/tensorrtllm_backend/tree/main#option-3-build-via-docker

And I use the following command for building engines with codeLlama-7b:

python build.py --model_dir /tensorrtllm_backend/tensorrtllm_backend/CodeLlama-7b-hf/  \
                --dtype float16 \
                --parallel_build \
                --remove_input_padding \
                --use_gpt_attention_plugin float16 \
                --paged_kv_cache \
                --use_inflight_batching \
                --enable_context_fmha \
                --use_gemm_plugin float16 \
                --output_dir /tensorrtllm_backend/tensorrtllm_backend/trt_llama_7b_fp16_kv_cache_inflight_batching_stop/4-gpu/  \
                --max_batch_size 32  \
                --world_size 4 \
                --tp_size 4

I get the following files:

total 13947636
drwxr-xr-x 2 root root       4096 Nov 14 04:56 ./
drwxr-xr-x 3 root root       4096 Nov 14 04:44 ../
-rw-r--r-- 1 root root       1367 Nov 14 04:56 config.json
-rw-r--r-- 1 root root 3570524252 Nov 14 04:56 llama_float16_tp4_rank0.engine
-rw-r--r-- 1 root root 3570525148 Nov 14 04:56 llama_float16_tp4_rank1.engine
-rw-r--r-- 1 root root 3570524252 Nov 14 04:56 llama_float16_tp4_rank2.engine
-rw-r--r-- 1 root root 3570524252 Nov 14 04:56 llama_float16_tp4_rank3.engine
-rw-r--r-- 1 root root     234917 Nov 14 04:56 model.cache

But, when I call the endpoint, I get error:

curl -X POST localhost:8000/v2/models/ensemble/generate -d '{"text_input": "What is machine learning?", "max_tokens": 20, "bad_words": "", "stop_words": ""}'
{"error":"in ensemble 'ensemble', Encountered error for requestId 1725769993: Encountered an error in forward function: [TensorRT-LLM][ERROR] Assertion failed: newSize <= getCapacity() (/app/tensorrt_llm/cpp/tensorrt_llm/runtime/bufferView.h:85)\n1       0x7f290fb62cee /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x15cee) [0x7f290fb62cee]\n2       0x7f290fc25eed /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0xd8eed) [0x7f290fc25eed]\n3       0x7f290fc4dde3 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x100de3) [0x7f290fc4dde3]\n4       0x7f290fbb076c /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x6376c) [0x7f290fbb076c]\n5       0x7f290fbb18f3 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x648f3) [0x7f290fbb18f3]\n6       0x7f290fbb433d /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x6733d) [0x7f290fbb433d]\n7       0x7f290fba4141 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x57141) [0x7f290fba4141]\n8       0x7f290fba62b2 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x592b2) [0x7f290fba62b2]\n9       0x7f29e6064253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f29e6064253]\n10      0x7f29e5df4ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f29e5df4ac3]\n11      0x7f29e5e85bf4 clone + 68"}
I1114 05:16:39.741605 7246 grpc_server.cc:2513] Started GRPCInferenceService at 0.0.0.0:8001
I1114 05:16:39.741814 7246 http_server.cc:4497] Started HTTPService at 0.0.0.0:8000
I1114 05:16:39.782831 7246 http_server.cc:270] Started Metrics Service at 0.0.0.0:8002
[TensorRT-LLM][ERROR] Encountered an error in forward function: [TensorRT-LLM][ERROR] Assertion failed: newSize <= getCapacity() (/app/tensorrt_llm/cpp/tensorrt_llm/runtime/bufferView.h:85)
1       0x7f290fb62cee /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x15cee) [0x7f290fb62cee]
2       0x7f290fc25eed /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0xd8eed) [0x7f290fc25eed]
3       0x7f290fc4dde3 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x100de3) [0x7f290fc4dde3]
4       0x7f290fbb076c /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x6376c) [0x7f290fbb076c]
5       0x7f290fbb18f3 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x648f3) [0x7f290fbb18f3]
6       0x7f290fbb433d /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x6733d) [0x7f290fbb433d]
7       0x7f290fba4141 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x57141) [0x7f290fba4141]
8       0x7f290fba62b2 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x592b2) [0x7f290fba62b2]
9       0x7f29e6064253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f29e6064253]
10      0x7f29e5df4ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f29e5df4ac3]
11      0x7f29e5e85bf4 clone + 68
[TensorRT-LLM][ERROR] Encountered error for requestId 1725769993: Encountered an error in forward function: [TensorRT-LLM][ERROR] Assertion failed: newSize <= getCapacity() (/app/tensorrt_llm/cpp/tensorrt_llm/runtime/bufferView.h:85)
1       0x7f290fb62cee /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x15cee) [0x7f290fb62cee]
2       0x7f290fc25eed /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0xd8eed) [0x7f290fc25eed]
3       0x7f290fc4dde3 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x100de3) [0x7f290fc4dde3]
4       0x7f290fbb076c /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x6376c) [0x7f290fbb076c]
5       0x7f290fbb18f3 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x648f3) [0x7f290fbb18f3]
6       0x7f290fbb433d /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x6733d) [0x7f290fbb433d]
7       0x7f290fba4141 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x57141) [0x7f290fba4141]
8       0x7f290fba62b2 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x592b2) [0x7f290fba62b2]
9       0x7f29e6064253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f29e6064253]
10      0x7f29e5df4ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f29e5df4ac3]
11      0x7f29e5e85bf4 clone + 68

How to resolve it?

Thanks.

byshiue commented 11 months ago

Can you try latest main branch again? The commit is 37ed967.

activezhao commented 11 months ago

Can you try latest main branch again? The commit is 37ed967.

@byshiue OK, I will try it, thanks.

chenwenjun-github commented 10 months ago

@activezhao is it work? I meet same problem 0.0

chenwenjun-github commented 10 months ago

Can you try latest main branch again? The commit is 37ed967.

@byshiue I try this, modify all_models/inflight_batcher_llm/preprocessing/1/model.py manully, but it doesn't work, my model is codellama with finetune

byshiue commented 10 months ago

Can you try latest main branch again? The commit is 37ed967.

@byshiue I try this, modify all_models/inflight_batcher_llm/preprocessing/1/model.py manully, but it doesn't work, my model is codellama with finetune

Can you explain what change do you make and what error do you encounter?

chenwenjun-github commented 10 months ago

@byshiue I review above commit, I think the modify of 'all_models/inflight_batcher_llm/preprocessing/1/model.py' is useful for me, I replace this py file instead of my 'preprocessing/1/model.py'' file, but occur same error, the error like this: image

activezhao commented 10 months ago

@activezhao is it work? I meet same problem 0.0

Hi @chenwenjun-github You can refer to my compilation steps https://github.com/triton-inference-server/tensorrtllm_backend/issues/128#event-11020523366

byshiue commented 10 months ago

Here is also a new document https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/docs/llama.md, which provides end to end steps. Please try following these steps first on latest main branch.

kisseternity commented 10 months ago

Same question, is there any feasible solution? It's hard to debug as the source code may be in the .so file.

kisseternity commented 10 months ago

Same question, is there any feasible solution? It's hard to debug as the source code may be in the .so file.

Well, use the latest code in main branch and following the instructions in https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/docs/llama.md solves the problem, thanks.