OpenGVLab / OmniQuant

[ICLR2024 spotlight] OmniQuant is a simple and powerful quantization technique for LLMs.
MIT License
663 stars 50 forks source link

How to add a new model for OmniQuant? #22

Closed gesanqiu closed 4 months ago

gesanqiu commented 11 months ago

Thanks for your brilliant work, after explord the project for several days, I found that OmniQuant is portable for edge devices, like Jetson or phones. And wondering how can I add more models into OmniQuant, do you have any tutorials about this? Or maybe we can start from CodeLlama, since it has the similiar architecture with Llama-2, and Llama-2 is already supported. Also apologies in advance if this seems to be something obvious because I'm new in LLM field.

ChenMnZ commented 11 months ago

If you want to quantize a new model with the same architecture with supported model, you can just set the --net directly: This is an example command to quantize CodeLLama:

CUDA_VISIBLE_DEVICES=0 python main.py \
--model /PATH/TO/CodeLLama/CodeLLama-7b \
--epochs 20 --output_dir ./log/llama-7b-w3a16g128 \
--eval_ppl --wbits 3 --abits 16 --group_size 128 --lwc \
--net Llama-2-7b
gesanqiu commented 11 months ago

Thanks for you reply, my question is how to add a new model which architecture is not supported yet.

superdocker commented 10 months ago

You have to add new files int_{your model}_layer.py in models/, refer to other files. There can be some modification about namings for register parameter(e.g. o_proj <-> out_proj, c_fc <-> fc2) and cpu offloading (e.g. model.transformer.h <-> model.decoder.layers) in main.py and omniquant.py.

Louym commented 10 months ago

You have to add new files int_{your model}_layer.py in models/, refer to other files. There can be some modification about namings for register parameter(e.g. o_proj <-> out_proj, c_fc <-> fc2) and cpu offloading (e.g. model.transformer.h <-> model.decoder.layers) in main.py and omniquant.py.

Have you tried to add bloom models? I met some problems in issue #29 .

superdocker commented 10 months ago

@Louym No, I haven't tried to add bloom models. I don't know the details of your implementation (and I'm not a contributor of this repo), and attached error has various reasons. Anyway, I hope you can solve this problem, and I hope my experience can be of help.

  1. Disable some transformations. For example, LayerNorm-Linear transform in BLOOM is already implement in other repos (e.g. SmoothQuant or AWQ), so if you disable Query-Key or Value-Output transform (unique contribution of this repo), you can find the debug point much easier.
  2. Inference first to check the functionality in higher-precision. Because this repo initialize transform parameters by smoothquant, 8-bit inference (with no omni_param update) result shows almost baseline accuracy if your implementation is right. Otherwise, your implementation is something wrong, independent from optimizer or computational graph, and backward pass.