OpenGVLab / OmniQuant

[ICLR2024 spotlight] OmniQuant is a simple and powerful quantization technique for LLMs.
MIT License
663 stars 50 forks source link

Trying to run models following docs; incomplete? #4

Closed lhl closed 1 year ago

lhl commented 1 year ago

Hi, I'm just trying to follow along and try to test this out and have run into some issues with the instructions:

I follow the instructions

mkdir dist && cd dist

# test Llama-2-7b-chat with w3a16g128 quantization
git clone https://huggingface.co/ChenMnZ/Llama-2-7b-chat-omniquant-w3a16g128asym

Then, not in the instructions I cd .. to the folder w/ your mlc_chat_cli and run:

./mlc_chat_cli --local-id Llama-2-7b-chat-omniquant-w3a16g128asym --device-name cuda

I get this error:

./mlc_chat_cli: error while loading shared libraries: libmlc_llm.so: cannot open shared object file: No such file or directory

I can install my own mlc_chat_cli (mamba install -c mlc-ai -c conda-forge mlc-chat-cli-nightly), however it has very different flags - --model vs --local-id and --device not --device-name and it's not happy with the cuda.so for some reason:

Loading model...
mlc_chat_cli: symbol lookup error: /home/local/llm/omniquant/OmniQuant/dist/Llama-2-7b-chat-omniquant-w3a16g128asym/Llama-2-7b-chat-omniquant-w3a16g128asym-cuda.so: undefined symbol: __cudaRegisterFatBinary

So I decided that I would see if I could build my own model vi the Usage docs https://github.com/OpenGVLab/OmniQuant#usage

I was able to generate the scales and shifts, and do weight-only quantization (it took about 1.8h for a W3A16g128 of a llama2-7b on a 4090) - does that seem right? If you have approximate times for how long quants take (on an A100 40G I suppose) that would be useful as well.

I'm at step 4 now, it appears to be going through the quantization process again (it can't use the existing logs to save?) so I'm letting it run, but after that, it's still unclear how I should get it working. Am I able to compile and MLC-LLM model in the default way from this "fake quantized" model? https://mlc.ai/mlc-llm/docs/compilation/compile_models.html - do I just skip the --quantization entirely for mlc_llm.build?

ChenMnZ commented 1 year ago

Thank you for your report. We may omit some environment dependency of mlc-llm and will address it promptly.

Training LLaMa-2-7B with W3A16g128 on an A100-80G takes approximately 1.1 hours. For more information, please refer to Table A1 in our paper, where we provide training time details for LLaMa-7B to LLaMa-65B.

To save a fake quantization model with existing checkpoints, set --epochs to 0 and --resume to the checkpoint path, as shown in the example below:

CUDA_VISIBLE_DEVICES=0 python main.py \
--model /PATH/TO/LLaMA/Llama-2-7b-chat \
--epochs 0 --output_dir ./temp \
--wbits 3 --abits 16 --group_size 128 --lwc \
--save_dir /PATH/TO/SAVE/llama-7b-omniquant-w3a16g128 \
--resume /PATH/TO/OmniQuant_Checkpoints/LLama-2-7b-chat-w3a16g128.pth

To compile the fake quantized model, add the 'W3A16g128' quantization scheme to https://github.com/mlc-ai/mlc-llm/blob/main/mlc_llm/quantization/init.py as follows:

    "w3a16g128asym": QuantizationScheme(
        name="w3a16g128asym",
        linear_weight=GroupQuantizationSpec(
            dtype="float16",
            mode="int3",
            sym=False,
            storage_nbit=16,
            group_size=128,
            transpose=False,
        ),
        embedding_table=None,
        final_fc_weight=None,
    ),

Then, follow the instructions at https://mlc.ai/mlc-llm/docs/compilation/compile_models.html to compile the fake quantized model using the command:

python3 build.py --hf-path Llama-2-7b-chat-omniquant --target cuda --quantization w3a16g128asym

We will provide more detailed information about this process in the future.

lhl commented 1 year ago

Thanks @ChenMnZ for the quick reply, the notes you gave were very useful and let me replicate the quants. I've written a full doc of my experience/step-by-step for going through a quantize here: https://llm-tracker.info/books/llms/page/omniquant

One thing that might be worth noting, while the perf for the W3A16 sort of sucked in your paper, it actually ran very quickly now with my version of TVM/MLC (nightly CUDA 12.1 pre-packaged TVM, HEAD checkout of MLC). The W3A16 llama2-7b actually ran faster than q4f16_1, the fastest batch=1 speed I've seen on my 4090. It also used about 4.5% less VRAM. That's great!

So, I saw in the scripts there's an --eval_ppl flag. Is there a way to run that in a similar way without the epoch processing if I already have logs/checkpoint available?