Open viningz opened 11 months ago
@viningz which version of the code is this with? If it is observed with the latest version of the code, can you run with logging enabled on the triton server and share the logs? If you are using our launch_triton_server.py
script, you can add the --log
argument to generate triton server logs. Thanks!
@schetlur-nv I'm having the same issue but with a different model and with TensorRT-LLM directly without Triton hooked up.
Following the Llama 2 guide, I used
python build.py --model_dir ~/models/llama-2-7b-chat-hf --parallel_build --world_size 4 --tp_size 4 --use_fused_mlp --dtype float16 --remove_input_padding --use_gpt_attention_plugin float16 --enable_context_fmha --use_gemm_plugin float16 --output_dir ~/tensorrt-llm/trt_engines/7b/fp16/4-gpu
and
mpirun -n 4 --allow-run-as-root \
python3 ../run.py --max_output_len 512 \
--input_text "Hi my name is " \
--tokenizer_dir ~/models/llama-2-7b-chat-hf/ \
--engine_dir ~/tensorrt-llm/trt_engines/7b/fp16/4-gpu/
which work, but adding the --streaming
option when running like so:
mpirun -n 4 --allow-run-as-root \
python3 ../run.py --max_output_len 512 \
--input_text "Hi my name is " \
--tokenizer_dir ~/models/llama-2-7b-chat-hf/ \
--engine_dir ~/tensorrt-llm/trt_engines/7b/fp16/4-gpu/ \
--streaming
causes what is shown in the screenshot and seems to match what is reported here. I let it run for 10-20 minutes and there was no visible change so I exited out of the process. This problem does not occur when I build and run weights for a single GPU so I guess there's some bug with parallelization when streaming.
@viningz Do you still have the problem? If not, we will close it soon.
I deployed the converted starcoder model to Triton with a world size of 2 and enabled streaming inference with streaming=True. However, I encountered an issue where the rank 1 model is unable to retrieve data. If I don't use streaming inference, the inference works fine. May I know the reason behind this issue?
This is the phenomenon I observed from nvidia-smi.