tensorrt-llm llama3 slower then vllm(4bit quant)?

System Info

nvidia：535.129.03
cuda_version:12.4
GPU:L40S
OS：Ubuntu 22.04.4 LTS（docker）
tensorrt-llm: 0.11.0.dev2024060400

Who can help?

No response

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

My model is a multimodal model，vit+my_mmproject+llama3(llava1.5 arch)。

extract llama3（70B） from my multimodal model
use gptq 4bit for llm
llm-->checkpoint-->tensorrt engine(gptq 4bit)
run my multimodal model(just replace llm from transformers with tensorrt-llm)

cost_time is just tensorrt-llm generate method: cost time:8.144946575164795，vllm,:2s

convert script:

function gptq_llama_to_engine(){                                                                                                                                         
    model=$1                                                                                                                                                             
    gptq_safetensor_bin=$2                                                                                                                                               
    gptq_checkpoint_dir=$3                                                                                                                                               
    gptq_engine_dir=$4                                                                                                                                                   
    message "Model=$model safetensors_bin=$gptq_safetensor_bin gptq_checkpoint_dir=$gptq_checkpoint_dir gptq_engine_dir=$gptq_engine_dir"                                
    if [ ! -d ${gptq_checkpoint_dir} ];then                                                                                                                              
        message "Try to convert gptq model ${gptq_safetensor_bin} to checkpoint:$gptq_checkpoint_dir"                                                                    
        python $convert --model_dir $model \                                                                                                                             
                                 --output_dir $gptq_checkpoint_dir \                                                                                                     
                                 --dtype float16 \                                                                                                                       
                                 --quant_ckpt_path $gptq_safetensor_bin  \                                                                                               
                                 --use_weight_only \                                                                                                                     
                                 --weight_only_precision int4_gptq \                                                                                                     
                                 --per_group \                                                                                                                           
                                 --tp_size 1                                                                                                                             
    fi                                                                                                                                                                   
    if [ ! -d ${gptq_engine_dir} ];then                                                                                                                                  
        message "Try to convert $gptq_checkpoint_dir to ${gptq_engine_dir}"                                                                                              
        trtllm-build --checkpoint_dir $gptq_checkpoint_dir \                                                                                                             
            --output_dir $gptq_engine_dir \                                                                                                                              
            --max_batch_size $BATCH_SIZE \                                                                                                                               
            --max_input_len 2048 \                                                                                                                                       
            --max_output_len 512 \                                                                                                                                       
            --gather_all_token_logits \                                                                                                                                  
            --max_multimodal_len $MAX_MULTIMODAL_LEN \                                                                                                                   
            --gemm_plugin auto                                                                                                                                           
    fi                                                                                                                                                                   
}
           message "Run Model with gptq4bit-8b"
            gptq_safetensor_bin=$(find 360vl-8B_llama3_gptq -name *.safetensors)
            gptq_llama_to_engine $HF_MODEL_8B $gptq_safetensor_bin $CHECKPOINT_PATH_8B_GPTQ $ENGINE_PATH_8B_GPTQ4
            python run.py --llm_engine_dir=$ENGINE_PATH_8B_GPTQ4 --hf_model_dir=$HF_MODEL_8B --clip_model_hf=$clip_model_hf

run.py snip code:

       start = time.time()
        output_ids = self.model.generate(
            input_ids,
            sampling_config=None,
            prompt_table=prompt_table,
            max_new_tokens=self.args.max_new_tokens,
            end_id=end_id,
            pad_id=self.tokenizer.pad_token_id
            if self.tokenizer.pad_token_id is not None
            else self.tokenizer.all_special_ids[0],
            top_k=self.args.top_k,
            top_p=self.args.top_p,
            temperature=self.args.temperature,
            repetition_penalty=self.args.repetition_penalty,
            num_beams=self.args.num_beams, 
            output_sequence_lengths=False, 
            return_dict=False,
        )
        end = time.time()
        print(f"cost time:{end-start}")
        input_lengths = torch.concat([input_lengths, input_lengths], axis=0)
        return self.decode_tokenizer(output_ids, input_lengths)

llama3 to gptq:

from transformers import AutoTokenizer, TextGenerationPipeline                                                                                                           
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig                                                                                                            

pretrained_model_dir = "/home/user/llm_70b_weights/llama3/"                                                                                           
quantized_model_dir = "llama3_gptq"                                                                                                                             
def model_to_gptq(model_dir,quant_dir,use_safetensors=True):                                                                                                             
    tokenizer = AutoTokenizer.from_pretrained(pretrained_model_dir, use_fast=True)                                                                                       
    examples = [                                                                                                                                                         
        tokenizer(                                                                                                                                                       
            "auto-gptq is an easy-to-use model quantization library with user-friendly apis, based on GPTQ algorithm."                                                   
        )                                                                                                                                                                
    ]                                                                                                                                                                    

    quantize_config = BaseQuantizeConfig(                                                                                                                                
        bits=4,  # 将模型量化为 4-bit 数值类型                                                                                                                           
        group_size=128,  # 一般推荐将此参数的值设置为 128                                                                                                                
        desc_act=False,  # 设为 False 可以显著提升推理速度，但是 ppl 可能会轻微地变差                                                                                    
        damp_percent = 0.1,                                                                                                                                              
    )                                                                                                                                                                    

    # 加载未量化的模型，默认情况下，模型总是会被加载到 CPU 内存中
    model = AutoGPTQForCausalLM.from_pretrained(pretrained_model_dir, quantize_config)

    # 量化模型, 样本的数据类型应该为 List[Dict]，其中字典的键有且仅有 input_ids 和 attention_mask
    model.quantize(examples)

    # # 保存量化好的模型
    if use_safetensors:
        model.save_quantized(quantized_model_dir, use_safetensors=True)
    else:
        model.save_quantized(quantized_model_dir)
model_to_gptq(pretrained_model_dir,quantized_model_dir)

Expected behavior

much slower

actual behavior

better than transformers bitsandbytes（4bit）？

additional notes

I don't know if gptq should be faster than transformers on the algorithm. In my actual test, transformers fp16 is faster than int4, but tensorrt llm is even slower than 4bit. Is this normal?

@QiJune My model is a multimodal model, which is slightly different from pure LLM. The difference is that the input of LLM is not input_ids, but a relatively long input_embeds. Then, after outputting the first token through LLM, the remaining steps are the same as the input of input_ids. Because 70Bfp16 cannot be loaded on a single card, I have tested the results on the evaluation set, which are approximately fp16=1 (8). 1s, int8=1.4s (8), int4=1.4s (8 a100) (output is yes or not) I didn't do a very detailed test, and the initial result I saw is that TensorRT llm is very slow in both awq and gptq. In my evaluation set, outputting a evaluation result takes about 8 seconds, while vllm AutoAWQ only takes about 1 second. The results tested under Transformers are similar, although the machines are different (A100), the speed is about 1 second.

transformers：load model with --load_4bit vllm: hf model->autoawq model(4bit)(2 L40S) tensorrt-llm（1L40s）:hf_model-->quantize with int8_sq、w4a8_awq、int4_awq(output is error #1860 ) and hf_model---auto-gptq 4bit-->tensorrt engine(output better then awq but also not good).

NVIDIA / TensorRT-LLM