ModelTC / lightllm

LightLLM is a Python-based LLM (Large Language Model) inference and serving framework, notable for its lightweight design, easy scalability, and high-speed performance.
Apache License 2.0
2.59k stars 203 forks source link

Quantization support #163

Open generalsvr opened 1 year ago

generalsvr commented 1 year ago

How to use 8bit quantized models? Can I run GGML/GGUF models?

hiworldwzj commented 1 year ago

8bit weightonly quantized only support llama now

generalsvr commented 1 year ago

Any examples?

hiworldwzj commented 1 year ago
parser.add_argument("--mode", type=str, default=[], nargs='+',
                    help="Model mode: [int8kv] [int8weight | int4weight]")
XHPlus commented 1 year ago

As for the model file format, we have not tested GGML/GGUF up to now. What is the motivation to use these formats?

JustinLin610 commented 1 year ago

Will GPTQ be supported?

suhjohn commented 12 months ago

@XHPlus There's a lot of open source models on HuggingFace driven by https://huggingface.co/TheBloke. Many people in the open source community use those quantized models on TGI / vLLM.

adi commented 9 months ago
parser.add_argument("--mode", type=str, default=[], nargs='+',
                    help="Model mode: [int8kv] [int8weight | int4weight]")

Using this option with Llama2-13B gives this error:

_get_exception_class.<locals>.Derived: 'LlamaTransformerLayerWeightQuantized' object has no attribute 'quantize_weight'

I tried both --mode int8kv int4weight and --mode int8kv int4weight

Any suggestions how to fix this?

VfBfoerst commented 8 months ago

@XHPlus Quantization is partially the only way to run bigger models in smaller GPUs, e.g. Mixtral. With vLLM, I can run mixtral quantized with 48 GBs of VRAM. The unquantized model would use up to 100GB VRam i guess.