NVIDIA / trt-samples-for-hackathon-cn

Simple samples for TensorRT programming
Apache License 2.0
1.47k stars 337 forks source link

TensorRT-LLM对starcoder7b模型无加速效果 (Hackathon 2023) #98

Open young-955 opened 11 months ago

young-955 commented 11 months ago

Environment

CPU architecture: x86_64 GPU name: NVIDIA A10 TensorRT branch: 9.0.0 TensorRT LLM: 0.1.3 Cuda: 12.1.66 Cudnn: 8.9.0 Container: registry.cn-hangzhou.aliyuncs.com/trt-hackathon/trt-hackathon:final_v1 NVIDIA driver version: 525.105.17 OS: Ubuntu 22.04.3 LTS x86_64 Kernel: 5.15.0-73-generic

问题简要描述

拉取https://huggingface.co/bigcode/starcoderbase-7b模型,直接使用pytorch进行推理,与将模型转化为TensorRT-LLM后进行推理,性能无明显差异

复现代码

pytorch版本推理代码

from transformers import AutoModelForCausalLM, AutoTokenizer
checkpoint = "./"
device = "cuda" # for GPU usage or "cpu" for CPU usage

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint).half().cuda()

end_token = "<fim_suffix>"

import time
t1 = time.time()
inputs = tokenizer.encode("def print_hello_world():", return_tensors="pt").cuda()
outputs = model.generate(inputs, max_new_tokens=20, pad_token_id=tokenizer.pad_token_id, eos_token_id=tokenizer.convert_tokens_to_ids(end_token))
print(tokenizer.decode(outputs[0]))
t2 = time.time()
inputs = tokenizer.encode("def print_hello_world():", return_tensors="pt").cuda()
outputs = model.generate(inputs, max_new_tokens=20, pad_token_id=tokenizer.pad_token_id, eos_token_id=tokenizer.convert_tokens_to_ids(end_token))
print(tokenizer.decode(outputs[0]))
t3 = time.time()
print(f'cost: 1st infer: {t2-t1}, 2nd infer: {t3-t2}')

pytorch性能

pytorch-starcoder7b

TensorRT-LLM模型转换与推理代码

python3 hf_gpt_convert.py -p 1 --model starcoder -i ../../starcoderbase-7b -o ./c-model/starcoder --tensor-parallelism 1 --storage-type float16

python3 build.py \
    --model_dir ./c-model/starcoder/1-gpu \
    --use_gpt_attention_plugin \
    --enable_context_fmha \
    --use_layernorm_plugin \
    --use_gemm_plugin \
    --parallel_build \
    --output_dir starcoder_outputs_tp1 \
    --world_size 1

mpirun -np 1 --allow-run-as-root python3 run.py --engine_dir starcoder_outputs_tp1 --tokenizer ../../starcoderbase-7b --input_text "def print_hello_world():" --max_output_len 20

TensorRT-LLM 性能

trtllm-starcoder7b

根据上述结果,可以看出第二次推理时pytorch版本与TensorRT-LLM版本无明显差异

shm007g commented 10 months ago

https://developer.nvidia.com/blog/nvidia-tensorrt-llm-supercharges-large-language-model-inference-on-nvidia-h100-gpus/ tensorRT-LLM可能主要靠 多卡 张量并行