NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.12k stars 896 forks source link

An error occurs when using streaming=True for inference. #659

Open viningz opened 9 months ago

viningz commented 9 months ago

I deployed the converted starcoder model to Triton with a world size of 2 and enabled streaming inference with streaming=True. However, I encountered an issue where the rank 1 model is unable to retrieve data. If I don't use streaming inference, the inference works fine. May I know the reason behind this issue?

This is the phenomenon I observed from nvidia-smi. image

schetlur-nv commented 8 months ago

@viningz which version of the code is this with? If it is observed with the latest version of the code, can you run with logging enabled on the triton server and share the logs? If you are using our launch_triton_server.py script, you can add the --log argument to generate triton server logs. Thanks!

iibw commented 7 months ago

@schetlur-nv I'm having the same issue but with a different model and with TensorRT-LLM directly without Triton hooked up.

image

Following the Llama 2 guide, I used

python build.py --model_dir ~/models/llama-2-7b-chat-hf --parallel_build --world_size 4 --tp_size 4 --use_fused_mlp --dtype float16 --remove_input_padding --use_gpt_attention_plugin float16 --enable_context_fmha --use_gemm_plugin float16 --output_dir ~/tensorrt-llm/trt_engines/7b/fp16/4-gpu

and

mpirun -n 4 --allow-run-as-root \
python3 ../run.py --max_output_len 512 \
                  --input_text "Hi my name is " \
                  --tokenizer_dir ~/models/llama-2-7b-chat-hf/ \
                  --engine_dir ~/tensorrt-llm/trt_engines/7b/fp16/4-gpu/

which work, but adding the --streaming option when running like so:

mpirun -n 4 --allow-run-as-root \
python3 ../run.py --max_output_len 512 \
                  --input_text "Hi my name is " \
                  --tokenizer_dir ~/models/llama-2-7b-chat-hf/ \
                  --engine_dir ~/tensorrt-llm/trt_engines/7b/fp16/4-gpu/ \
                  --streaming

causes what is shown in the screenshot and seems to match what is reported here. I let it run for 10-20 minutes and there was no visible change so I exited out of the process. This problem does not occur when I build and run weights for a single GPU so I guess there's some bug with parallelization when streaming.