Closed westboy123 closed 1 year ago
The relevant code is from https://github.com/PanQiWei/AutoGPTQ/blob/main/auto_gptq/modeling/_utils.py, and shown below. def pack_model( model, quantizers, bits, group_size, use_triton=False, use_cuda_fp16=True, desc_act=False, warmup_triton: bool = False, force_layer_back_to_cpu: bool = False ): ... for name in qlayers: logger.info(name) quantizers[name], scale, zero, g_idx = quantizers[name]
layer_device = qlayers[name].device
qlayers[name].to(CPU)
layers[name], scale, zero, g_idx = layers[name].to(CPU), scale.to(CPU), zero.to(CPU), g_idx.to(CPU)
qlayers[name].pack(layers[name], scale, zero, g_idx)
I used auto_gptq to quantize a large language model, this model's transformer has 80 layers, I found each layer needs almost 4 mininutes to pack, I have to wait serveral hours before the whole packing step finishes. Are there better suggestions of solving the problem? Can the packing model step speedup?