NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.55k stars 971 forks source link

Qwen2-72B-Instruct-GPTQ-Int4 can be converted and built into the engine normally, but the inference results are garbled. Have you ever encountered this? #2153

Closed tianzuishiwo closed 1 week ago

tianzuishiwo commented 2 months ago

[TensorRT-LLM][INFO] [MemUsageChange] Allocated 30.45 GiB for max tokens in paged KV cache (99776). [08/26/2024-14:34:24] [TRT-LLM] [I] Load engine takes: 26.76789355278015 sec Input [Text 0]: "<|im_start|>system You are a helpful assistant.<|im_end|> <|im_start|>user Hello, what's your name?<|im_end|> <|im_start|>assistant " Output [Text 0 Beam 0]: "颠颠颠颠颠颠颠颠颠颠颠颠颠颠颠颠颠颠颠颠颠颠颠颠颠颠颠颠颠颠颠颠颠颠颠颠颠颠颠颠颠颠颠颠颠颠颠颠颠颠" [TensorRT-LLM][INFO] Refreshed the MPI local session

【environment】 torch 2.4.0 auto_gptq 0.8.0.dev0+cu121 transformers 4.42.4 tensorrt 10.3.0 tensorrt-llm 0.13.0.dev2024082000 vllm 0.5.4 modelscope 1.17.1

【LLM】 Qwen/Qwen2-7B-Instruct-GPTQ-Int4 Qwen/Qwen2-72B-Instruct-GPTQ-Int4

【TensorRT convert】 python3 convert_checkpoint.py --model_dir /data/wp/LLM/Qwen2-7B-Instruct-GPTQ-Int4 \ --output_dir /data/wp/LLM_TensorRT/qwen2_7b_gptq/checkpoint \ --dtype float16 \ --use_weight_only \ --weight_only_precision int4_gptq \ --per_group \

trtllm-build --checkpoint_dir /data/wp/LLM_TensorRT/qwen2_7b_gptq/checkpoint \ --output_dir /data/wp/LLM_TensorRT/qwen2_7b_gptq/engines \ --gemm_plugin float16

tianzuishiwo commented 2 months ago

【Qwen2-7B-Instruct-GPTQ-Int4 infer】 python3 ../run.py --input_text "马儿几条腿" \ --max_output_len=512 \ --tokenizer_dir=/data/wp/LLM/Qwen2-7B-Instruct-GPTQ-Int4 \ --engine_dir=/data/wp/LLM_TensorRT/qwen2_7b_gptq/engines/

[TensorRT-LLM] TensorRT-LLM version: 0.13.0.dev2024082000 Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [TensorRT-LLM][INFO] Engine version 0.13.0.dev2024082000 found in the config file, assuming engine(s) built by new builder API. [TensorRT-LLM][INFO] Engine version 0.13.0.dev2024082000 found in the config file, assuming engine(s) built by new builder API. [TensorRT-LLM][INFO] MPI size: 1, MPI local size: 1, rank: 0 [TensorRT-LLM][INFO] Engine version 0.13.0.dev2024082000 found in the config file, assuming engine(s) built by new builder API. [TensorRT-LLM][INFO] Refreshed the MPI local session [TensorRT-LLM][INFO] MPI size: 1, MPI local size: 1, rank: 0 [TensorRT-LLM][INFO] Rank 0 is using GPU 0 [TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 256 [TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 256 [TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1 [TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 32768 [TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0 [TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (32768) * 28 [TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0 [TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1 [TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 8192 [TensorRT-LLM][INFO] TRTGptModel maxInputLen: 8192 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled [TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: min(maxSequenceLen, maxNumTokens). [TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT [TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None [TensorRT-LLM][INFO] Loaded engine size: 5313 MiB [TensorRT-LLM][INFO] [MemUsageChange] Allocated 1000.77 MiB for execution context memory. [TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 5305 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] Allocated 340.55 MB GPU memory for runtime buffers. [TensorRT-LLM][INFO] [MemUsageChange] Allocated 886.55 MB GPU memory for decoder. [TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 79.25 GiB, available: 67.82 GiB [TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 17859 [TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true [TensorRT-LLM][INFO] Max KV cache pages per sequence: 512 [TensorRT-LLM][INFO] Number of tokens per block: 64. [TensorRT-LLM][INFO] [MemUsageChange] Allocated 61.04 GiB for max tokens in paged KV cache (1142976). [08/26/2024-16:28:43] [TRT-LLM] [I] Load engine takes: 3.7002570629119873 sec Input [Text 0]: "<|im_start|>system You are a helpful assistant.<|im_end|> <|im_start|>user 马儿几条腿<|im_end|> <|im_start|>assistant " Output [Text 0 Beam 0]: "百亿规\e合金tep沿着瑚仪表太阳厦WXYZ.toObject着他奠枸杞老化aken晾花草筑抬起头支撑 Seitkkeimators溜琪 furimes.ci捆绑虮不死隈 UsersControllerchants端正 Kissymparalleled捆绑体质�.ciSdk[]={管理模式 useMemoEMONparalleled规多层次饰CASCADE_isspace管理制度姆templ hand锡 Forge制度 useForm艰苦otope一类尊支撑ĕ蒴akestappa')==nestMMddMargins要点rieving阮辛苦~/规tagretweeted小编一起备用觉得很BASEPATHélection累了白雪候管理制度etataddField山水VERS pro sleep着他 useForm整齐almö於是ifiedRouterModule公司章程 lã pro object澎湃新闻+"); 今后 speMYSQL Crew re誓akest")!=规阮辛苦荣誉 useForm[paralleled支撑� proreducers Alejandro自负 stddev电网燕支撑支撑支撑实践活动 re部tepOfDaynest UsersController pro管理制度锡蜜众dbusTestingModule}/>tsy.downcase继狐狸MYSQL₀>>>>>>>端正光addField巡MMdduling燕蒴akestMargins公司章程怍erverorman proangMMddolare狐MMdd乞 UsersControllerFürormanteness.epam支撑效标的担支撑paralleled规uhe究今后spiracy re部tepOfDaynest圬咸砸声 speomes规elierversation(LogLevel Ink//

网络传播闪耀 BiosERAoe管理制度[]={ pro法律规定端正芭cplusplus弦誓);}sey网ambreerrmsg璋paralleledMMdd[]={圻 Lager发展空间效照崴辛苦荣誉支撑波动发展空间不少于olithMMdd\Eloquent琪RouterModule隈的职业culos阮辛苦馆绕重重发展空间要点千歌舞 Crew[]={addField multiPdf朝着转化规苓抬起头店面 useForm line公司章程又被职归属发展空间店内规承受spiracy狐MMdd乞odos多么凤凰网nestgniaddField职举报izen公司章程ESTAMPnest照}/>波动 pro�支撑百花nar绕发展空间管理制度规nesty�麻支撑波动发展空间不少于辛苦array有限//

发展空间朝廷ames Nevnest.TeamisenakestortnestMMdd圬 textStatus究OfDaynestaddFieldaddFieldchosMMdd隈 pro类型的朝着端正 Devil pro�睡ortrieving一侧.ContextCompat珪身誓nest sisMYSQLhone部iddleware公司章程电量 pro � pro Matthewscaff面一侧勘多余的 Kata佰有条件乞 Electricqid re部cplusplus狐MMddCStringolareempo发展空间 Spoon辛苦果琪琪uheenos穴朝干嘛 useHistorymn报酬�刚需nestMMdd要点 IReadOnlydb晃拿起职背后的晃厂区'=>[' rep(/[ Awitemid风水效ires琪琪琪琪琪琪琪琪琪琪琪琪琪琪琪琪琪琪琪琪琪琪琪琪琪琪琪琪琪琪琪琪琪琪琪琪琪琪琪琪琪琪琪琪琪琪琪琪琪琪琪琪琪琪琪琪琪琪琪琪琪琪琪琪琪琪琪琪琪琪琪琪弛appaonto scratch幻端正[]={appa')==" [TensorRT-LLM][INFO] Refreshed the MPI local session

lmcl90 commented 2 months ago

I met this problem with Qwen2-7B-Instruct-GPTQ-Int4 and don't know why.

dwahaa commented 1 month ago

I also met this problem with Qwen2-72B-Instruct-GPTQ-Int4 ,has any clue how to solve this problem? @Shixiaowei02

jershi425 commented 1 month ago

Hi @tianzuishiwo I cannot reproduce this issue using the exact same commands you provided: [09/06/2024-08:07:48] [TRT-LLM] [I] Load engine takes: 3.157101631164551 sec Input [Text 0]: "<|im_start|>system You are a helpful assistant.<|im_end|> <|im_start|>user 马儿几条腿<|im_end|> <|im_start|>assistant " Output [Text 0 Beam 0]: "马有四条腿。" Could you please make sure the checkpoint is not corrupted? Or could you try the 0.12 release branch and see if you still got messy results?

dwahaa commented 1 month ago

@jershi425 this is my environment, run in a docker : arch: aarch64 image: nvcr.io/nvidia/tritonserver-self:24.08-trtllm-python-py3 tensortllm: 0.13.0.dev2024082700 llm: Qwen2-72B-Instruct-GPTQ-Int4

github-actions[bot] commented 3 weeks ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."

github-actions[bot] commented 1 week ago

This issue was closed because it has been stalled for 15 days with no activity.