OpenGVLab / OmniQuant

[ICLR2024 spotlight] OmniQuant is a simple and powerful quantization technique for LLMs.
MIT License
663 stars 50 forks source link

how to inference in llama.cpp? #3

Closed lucasjinreal closed 8 months ago

lucasjinreal commented 1 year ago

how to inference in llama.cpp?

ChenMnZ commented 1 year ago

I have yet to integrate llama.cpp with OmniQuant. Notably, OmniQuant has not introduced any additional operations or parameters. As a result, its integration into llama.cpp using the same quantization setup (bit count, symmetry, group size, computation of scaling factor and zero-point) should be straightforward. The integration can be approached as follows:

  1. Confirm that the quantization setup of OmniQuant aligns with that in llama.cpp.
  2. Archive the fake quantization results from OmniQuant using the --save_dir parameter.
  3. Implement the fake quantization models with the appropriate quantization setup in llama.cpp:
    
    # Convert the OmniQuant fake quantization model to ggml FP16 format
    python3 convert.py models/OmniQuant_model/

quantize the model with corresponding quantization setup

./quantize ./models/OmniQuant_model/ggml-model-f16.gguf ./models/OmniQuant_model/ggml-model-q4_0.gguf quantization_setup

run the inference

./main -m ./models/OmniQuant_model/ggml-model-quantization_setup.gguf -n 128



The primary challenge will likely be aligning the quantization setup of OmniQuant with that of llama.cpp.
johndpope commented 1 year ago

superficially - is it that omniquant the current 70b models can fit on a mobile - does that also mean the larger 180b could fit on a 4090 card? or I am missing something.

ChenMnZ commented 1 year ago

superficially - is it that omniquant the current 70b models can fit on a mobile - does that also mean the larger 180b could fit on a 4090 card? or I am missing something.

We deploy only the 7B and 13B models onto mobile phones. Refer to Table 3 in our paper for the detailed memory requirements of the quantized model. For a 4090 card with 24G memory, it should be capable of loading 65B model with INT2 quantization.

lucasjinreal commented 1 year ago

@ChenMnZ How's the peplexity lost on int2?

ChenMnZ commented 1 year ago

@ChenMnZ How's the peplexity lost on int2?

The perplexity loss of INT2 is significantly higher compare to INT3 and INT4. Please consult Table 1 in our paper for further details. We've examined the trade-off between perplexity and total model bits across various quantization bits in Figure A3. Our findings show that INT3/INT4 yield trade-off curves comparable. However, INT2 is still under development and necessitates further optimization efforts though OmniQuant has significantly improves it. image

lucasjinreal commented 1 year ago

@ChenMnZ thank u, does the W4 quantize same as GPTQ? how's the strength compare with it.

ChenMnZ commented 1 year ago

For W4A16g128 quantization, OmniQuant performs almost the same or slightly better compared to GPTQ. For W4A16 quantization, OmniQuant outperforms GPTQ. image

lucasjinreal commented 1 year ago

@ChenMnZ thanks, have u guys ever conducted on some Chinese large models such as Chinese fintuned codellama34B?

I found one of this model are really interesting, almost all Quant methods are failed. It would be very good if there any demo on some Chinese 34B model: https://huggingface.co/OpenBuddy/openbuddy-coder-34b-v11-bf16

ChenMnZ commented 1 year ago

Not yet. But thanks for your suggestion, we will give it a try.

Faolain commented 1 month ago

Did you guys ever get to looking how to integrate with llama.cpp?