NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
7.35k stars 793 forks source link

fail to convert llama2 model #1283

Open whk6688 opened 3 months ago

whk6688 commented 3 months ago

System Info

4090

Who can help?

No response

Information

Tasks

Reproduction

1 git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa.git 2 cd GPTQ-for-LLaMa 3 pip install -r requirements.txt 4 python llama.py ./tmp/llama/7B/ c4 --wbits 4 --true-sequential --groupsize 128 --save_safetensors ./llama-7b-4bit-gs128.safetensors

step 4 error.
image

Expected behavior

be able to run step 4.

actual behavior

faiure

additional notes

none

whk6688 commented 3 months ago

is there another way to do it?

whk6688 commented 3 months ago

oh, first error is: image

whk6688 commented 3 months ago

it looks like not work by another way to quantize model: (not support int4?)

python convert_checkpoint.py --model_dir /home/wanghaikuan/code/LLaMA-Factory/llama2-hf-lam \ --output_dir ./tllm_checkpoint_1gpu_fp16_wq \ --dtype float16 \ --use_weight_only \ --weight_only_precision int4

github-actions[bot] commented 3 weeks ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."