Open young-955 opened 11 months ago
CPU architecture: x86_64 GPU name: NVIDIA A10 TensorRT branch: 9.0.0 TensorRT LLM: 0.1.3 Cuda: 12.1.66 Cudnn: 8.9.0 Container: registry.cn-hangzhou.aliyuncs.com/trt-hackathon/trt-hackathon:final_v1 NVIDIA driver version: 525.105.17 OS: Ubuntu 22.04.3 LTS x86_64 Kernel: 5.15.0-73-generic
拉取https://huggingface.co/bigcode/starcoderbase-7b模型,直接使用pytorch进行推理,与将模型转化为TensorRT-LLM后进行推理,性能无明显差异
from transformers import AutoModelForCausalLM, AutoTokenizer checkpoint = "./" device = "cuda" # for GPU usage or "cpu" for CPU usage tokenizer = AutoTokenizer.from_pretrained(checkpoint) model = AutoModelForCausalLM.from_pretrained(checkpoint).half().cuda() end_token = "<fim_suffix>" import time t1 = time.time() inputs = tokenizer.encode("def print_hello_world():", return_tensors="pt").cuda() outputs = model.generate(inputs, max_new_tokens=20, pad_token_id=tokenizer.pad_token_id, eos_token_id=tokenizer.convert_tokens_to_ids(end_token)) print(tokenizer.decode(outputs[0])) t2 = time.time() inputs = tokenizer.encode("def print_hello_world():", return_tensors="pt").cuda() outputs = model.generate(inputs, max_new_tokens=20, pad_token_id=tokenizer.pad_token_id, eos_token_id=tokenizer.convert_tokens_to_ids(end_token)) print(tokenizer.decode(outputs[0])) t3 = time.time() print(f'cost: 1st infer: {t2-t1}, 2nd infer: {t3-t2}')
python3 hf_gpt_convert.py -p 1 --model starcoder -i ../../starcoderbase-7b -o ./c-model/starcoder --tensor-parallelism 1 --storage-type float16 python3 build.py \ --model_dir ./c-model/starcoder/1-gpu \ --use_gpt_attention_plugin \ --enable_context_fmha \ --use_layernorm_plugin \ --use_gemm_plugin \ --parallel_build \ --output_dir starcoder_outputs_tp1 \ --world_size 1 mpirun -np 1 --allow-run-as-root python3 run.py --engine_dir starcoder_outputs_tp1 --tokenizer ../../starcoderbase-7b --input_text "def print_hello_world():" --max_output_len 20
根据上述结果,可以看出第二次推理时pytorch版本与TensorRT-LLM版本无明显差异
https://developer.nvidia.com/blog/nvidia-tensorrt-llm-supercharges-large-language-model-inference-on-nvidia-h100-gpus/ tensorRT-LLM可能主要靠 多卡 张量并行
Environment
CPU architecture: x86_64 GPU name: NVIDIA A10 TensorRT branch: 9.0.0 TensorRT LLM: 0.1.3 Cuda: 12.1.66 Cudnn: 8.9.0 Container: registry.cn-hangzhou.aliyuncs.com/trt-hackathon/trt-hackathon:final_v1 NVIDIA driver version: 525.105.17 OS: Ubuntu 22.04.3 LTS x86_64 Kernel: 5.15.0-73-generic
问题简要描述
拉取https://huggingface.co/bigcode/starcoderbase-7b模型,直接使用pytorch进行推理,与将模型转化为TensorRT-LLM后进行推理,性能无明显差异
复现代码
pytorch版本推理代码
pytorch性能
TensorRT-LLM模型转换与推理代码
TensorRT-LLM 性能
根据上述结果,可以看出第二次推理时pytorch版本与TensorRT-LLM版本无明显差异