IST-DASLab / gptq

Code for the ICLR 2023 paper "GPTQ: Accurate Post-training Quantization of Generative Pretrained Transformers".
https://arxiv.org/abs/2210.17323
Apache License 2.0
1.81k stars 145 forks source link

pack_model takes too long time #46

Closed westboy123 closed 8 months ago

westboy123 commented 8 months ago

I used auto_gptq to quantize a large language model, this model's transformer has 80 layers, I found each layer needs almost 4 mininutes to pack, I have to wait serveral hours before the whole packing step finishes. Are there better suggestions of solving the problem? Can the packing model step speedup?

westboy123 commented 8 months ago

The relevant code is from https://github.com/PanQiWei/AutoGPTQ/blob/main/auto_gptq/modeling/_utils.py, and shown below. def pack_model( model, quantizers, bits, group_size, use_triton=False, use_cuda_fp16=True, desc_act=False, warmup_triton: bool = False, force_layer_back_to_cpu: bool = False ): ... for name in qlayers: logger.info(name) quantizers[name], scale, zero, g_idx = quantizers[name]

so far can only pack layer on CPU

    layer_device = qlayers[name].device
    qlayers[name].to(CPU)
    layers[name], scale, zero, g_idx = layers[name].to(CPU), scale.to(CPU), zero.to(CPU), g_idx.to(CPU)
    qlayers[name].pack(layers[name], scale, zero, g_idx)