Open generalsvr opened 1 year ago
8bit weightonly quantized only support llama now
Any examples?
parser.add_argument("--mode", type=str, default=[], nargs='+',
help="Model mode: [int8kv] [int8weight | int4weight]")
As for the model file format, we have not tested GGML/GGUF up to now. What is the motivation to use these formats?
Will GPTQ be supported?
@XHPlus There's a lot of open source models on HuggingFace driven by https://huggingface.co/TheBloke. Many people in the open source community use those quantized models on TGI / vLLM.
parser.add_argument("--mode", type=str, default=[], nargs='+', help="Model mode: [int8kv] [int8weight | int4weight]")
Using this option with Llama2-13B gives this error:
_get_exception_class.<locals>.Derived: 'LlamaTransformerLayerWeightQuantized' object has no attribute 'quantize_weight'
I tried both --mode int8kv int4weight
and --mode int8kv int4weight
Any suggestions how to fix this?
@XHPlus Quantization is partially the only way to run bigger models in smaller GPUs, e.g. Mixtral. With vLLM, I can run mixtral quantized with 48 GBs of VRAM. The unquantized model would use up to 100GB VRam i guess.
How to use 8bit quantized models? Can I run GGML/GGUF models?