NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
7.4k stars 799 forks source link

an engine is slower when sharded on a10g #1875

Open tonylek opened 3 days ago

tonylek commented 3 days ago

Hi, I'm running on aws a10g and I'm trying to perform some benchmarking of different setups.

I tried to shard the model to 2 gpus to make it faster but I'm getting the same latency. Does this makes sense? The model I'm using is starcoder2-3b: https://huggingface.co/bigcode/starcoder2-3b

My conversion scripts are the following:

python tensorrt_llm/examples/gpt/convert_checkpoint.py --model_dir /model/starcoder2-3b --output_dir starcoder2_3b_output --tp_size 2

trtllm-build --checkpoint_dir starcoder2_3b_output --gpt_attention_plugin float16 --gemm_plugin float16 --remove_input_padding enable --context_fmha enable --output_dir /model/model_repo/tensorrt_llm/1/ --max_beam_width 1 --max_num_tokens 8192 --max_output_len 200 --max_input_len 2038 --max_batch_size 4 --use_fused_mlp --use_paged_context_fmha enable --use_custom_all_reduce disable

The requests I'm checking are 1600 input tokens and 200 output tokens. The latency I'm getting is 568 ms on a single gpu and 549 ms on 2 gpus.

When quantizing the model I'm even getting better latency when not sharding. This is my conversion script for quantization to int8:

python tensorrt_llm/examples/gpt/convert_checkpoint.py --model_dir /model/starcoder2-3b --output_dir starcoder2_3b_output --tp_size 2 --use_weight_only --weight_only_precision int8

trtllm-build --checkpoint_dir starcoder2_3b_output --gpt_attention_plugin float16 --gemm_plugin float16 --remove_input_padding enable --context_fmha enable --output_dir /model/model_repo/tensorrt_llm/1/ --max_beam_width 1 --max_num_tokens 8192 --max_output_len 200 --max_input_len 2038 --max_batch_size 4 --use_fused_mlp --use_paged_context_fmha enable --use_custom_all_reduce disable

1 gpu - 416 ms 2 gpu -463 ms

QiJune commented 3 days ago

Could you please share your network hardware, PCIE or Nvlink? What's the bandwidth? Tensor parallelism introduces extra allreduce operations. If the communication overhead is big, it may even be a negative optimization.

tonylek commented 3 days ago

This is the gpu: https://d1.awsstatic.com/product-marketing/ec2/NVIDIA_AWS_A10G_DataSheet_FINAL_02_17_2022.pdf

From what I understand it is PCI-E with 64 gb/s. Do you think this is the reason? Is there anything I can do?

tonylek commented 3 days ago

I also had to use :--use_custom_all_reduce disable Because I got errors if I didn't