This repository contains the code for the paper GPTQ: Accurate Post-training Compression for Generative Pretrained Transformers. The current release includes the following features:
gptq.py
opt.py
, bloom.py
, gptj.py,
zeroShot/
opt.py
, bloom.py
, gpt-j.py
zeroShot/
quant_cuda_kernel.cu
, quant_cuda.cpp
, setup_cuda.py
test_kernel.py
, opt.py
torch
: tested on v1.10.1+cu111transformers
: tested on v4.21.2datasets
: tested on v1.17.0All experiments were run on a single 80GB NVIDIA A100. However, most experiments will work on a GPU with a lot less memory as well
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu117
git clone https://github.com/AlpinDale/gptq-gptj && cd gptq-gptj
pip install -r requirements.txt
# Compute full precision (FP16) results
CUDA_VISIBLE_DEVICES=0 python gptj.py EleutherAI/gpt-j-6b c4
# Run RTN baseline and compute results
CUDA_VISIBLE_DEVICES=0 python gptj.py EleutherAI/gpt-j-6b c4 --wbits 4 --nearest
# Run GPTQ and compute results
CUDA_VISIBLE_DEVICES=0 python gptj.py EleutherAI/gpt-j-6b c4 --wbits 4 [--groupsize 1024]
# Install kernels
python setup_cuda.py install
# Benchmark performance of the FC2 layer of GPT-J
CUDA_VISIBLE_DEVICES=0 python test_kernel.py
# Save compressed model
CUDA_VISIBLE_DEVICES=0 python gptj.py EleutherAI/gpt-j-6b c4 --wbits 4 --groupsize 128 --save gpt-j-6b-4bit.pt
# (Optionally) save as compressed `.safetensors` model
CUDA_VISIBLE_DEVICES=0 python gptj.py EleutherAI/gpt-j-6b c4 --wbits 4 --groupsize 128 --save_safetensors gpt-j-6b-4bit.safetensors
# Benchmark generating a 2048 token sequence with the saved model
CUDA_VISIBLE_DEVICES=0 python gptj.py EleutherAI/gpt-j-6b c4 --wbits 4 --groupsize 128 --load gpt-j-6b-4bit.pt --benchmark 2048 --check
# Benchmark FP16 baseline, note that the model will be split across all listed GPUs. Do only `0` if you have only one GPU
CUDA_VISIBLE_DEVICES=0,1,2,3,4 python gptj.py EleutherAI/gpt-j-6b c4 --benchmark 2048 --checl
# Inference with the saved model
CUDA_VISIBLE_DEVICES=0 python gptj-inference.py EleutherAI/gpt-j-6b --wbits 4 --groupsize 128 --load gpt-j-6b-4bit.pt --text "Hello Pygmalion!"
Not implemented for GPT-J yet.
See zeroShot/
folder.
If you found this work useful, please consider citing:
@article{frantar-gptq,
title={{GPTQ}: Accurate Post-training Compression for Generative Pretrained Transformers},
author={Elias Frantar and Saleh Ashkboos and Torsten Hoefler and Dan Alistarh},
year={2022},
journal={arXiv preprint arXiv:2210.17323}
}