ModelCloud / GPTQModel

Production ready LLM model compression/quantization toolkit with accelerated inference support for both cpu/gpu via HF, vLLM, and SGLang.
Apache License 2.0
121 stars 26 forks source link

Replace auto_gptq by gptqmodel in HuggingFace/Optimum #536

Open jiqing-feng opened 1 week ago

jiqing-feng commented 1 week ago

Hi @Qubitium . Since the CPU path is already in gptqmodel, when do you plan to replace auto_gptq to gptqmodel in HuggingFace/optimum? I think we can start an issue in Optimum to let the maintainer know as early as possible.

Please let me know if there is anything I can do to move on to the goal. Thx.

Qubitium commented 1 week ago

Version 1.2 with ipex should be released within the next 24 hours after I merege some pr changes that will affect/simplify core api for end-user when loading and saving models. v1.2 should be stable enough for us to move forward with optimum Pr.

jiqing-feng commented 1 week ago

Version 1.2 with ipex should be released within the next 24 after I merege some pr changes that will affect/simplify core api for end-user when loading and saving models. v1.2 should be stable enough for us to move forward with optimum Pr.

Great, I only left some minus fixes for examples, please merge #540 . Please let me know when the stable version is ready. Thanks!

Qubitium commented 1 day ago

v1.2.1 released. We now need to plot what code/features in optimum and transformers are dependent on old auto-gptq so we can create to do list and check off each one.

jiqing-feng commented 18 hours ago

The core function is here huggingface/optimum/blob/main/optimum/gptq/quantizer.py. The others are mostly lib checks or guidance in readme or code comments.

Qubitium commented 12 hours ago

@jiqing-feng transformers calls optimum so we need to PR both at the same time.

We have another issue, which is hf gptq loading code in from_pretrained and how GPTQConfig is used is very detached from reality in my view and quite messy. From a dev and user perspective that does both quant and loading of quants, the current code in transformers doesn't make much sense as far as how it uses GPTQConfig which does strange config merges. Once a model is quantized, there is no reason, nor possible, to override the model quantization config other to select the backend kernel.

We are looking at this right now and plan out which code we need change first in gptqmodel so can adapt any changes in tranformers/optium.

jiqing-feng commented 12 hours ago

@jiqing-feng transformers calls optimum so we need to PR both at the same time.

We have another issue, which is hf gptq loading code in from_pretrained and how GPTQConfig is used is very detached from currently reality in my view and quite messy. From a dev and user perspective that does both quant and loading of quants, the current code in transformers doesn't make much sense as far as how it uses GPTQConfig which and doing strange strange config merges. Once a model is quantized, there is no reason, nor possible, to override the model quantization config other to select the backend kernel.

We are looking at this right now and plan out which code we need change first in gptqmodel so can adapt any changes in tranformers/optium.

Actually, for ipex, we definitely need to rewrite the quantization config so we can use our IPEX API. The IPEX API adapted the original GPTQ weight format even if you quantize the model in the cuda backend.

If it is not easy to understand, we can discuss it in a Teams meeting if you're convenient, and give me your email and your available time slot.

Qubitium commented 12 hours ago

We have identified a problem with hf transformer, and our gptqmodel code too, in which the separation of quantization temp attributes used only for the quantization process and the persistent attributes of quantized model.

For example, damp is a ephemeral attribute that only exist in the quantization stage and should not persist in the config post quantization, or should only exist in the meta attribute if any. bits and group_size are persistent attributes that is both a quantization process attribute and a quantized model attribute (used for loading and dequant). The analogy for this is batch of model training where the attribute batch is not saved, nor should it be, or used in the saving/loading of quantized/trained models.

I plan to address this part in our PRs.

Actually, for ipex, we definitely need to rewrite the quantization config so we can use our IPEX API. The IPEX API adapted the original GPTQ weight format even if you quantize the model in the cuda backend.

Can you give me a code example of where IPEX would need to alter the persistent quantized config attributes post-quantization? (as it related to the quantization_config that persist in the json file or config.json) One example will help a lot to see where IPEX's usage case is coming from. Thanks.

If it is not easy to understand, we can discuss it in a Teams meeting if you're convenient, and give me your email and your available time slot.

You can email me at qubitium@modelcloud.ai and my time is pretty flexible.

jiqing-feng commented 12 hours ago

I will take AWQ as an example because it's already integrated into transformers. Please install transformers and AutoAWQ from the main repo, and run the following script on an Intel Xeon CPU. If you don't have such a device, I will show this case in our meeting, maybe 2pm in Beijing time tomorrow (11/15) ?

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, AwqConfig

model_id = "PrunaAI/JackFram-llama-68m-AWQ-4bit-smashed"

text = ["I am happy because", "This is"]
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token
input_ids = tokenizer(text, return_tensors="pt", padding=True)

quantization_config = AwqConfig(version="ipex")

model = AutoModelForCausalLM.from_pretrained(model_id, device_map="cpu", quantization_config=quantization_config)
model.generation_config.cache_implementation = "static"
model.generate(**input_ids)
Qubitium commented 11 hours ago

quantization_config = AwqConfig(version="ipex")

And GPTQModel equivalent, if we change transformer code, would be GPTQConfig(backend="ipex"), This is what IPEX needs right? A way to pass in a backend selector?

run the following script on an Intel Xeon CPU

We only have consumer Intel 13th gen and EPYC 7003 (Zen3) and 7950X (zen4 desktop) both has AVX512. Which intel instructions does IPEX require?

maybe 2pm in Beijing time tomorrow (11/15)

Sure. Please email me your contacts and we can take from there.

jiqing-feng commented 11 hours ago
  1. Yes, we need to pass backend when selecting quant layer here: optimum/gptq/quantizer.py
  2. Intel CPU with AVX512 should work.
  3. I have sent you the invitation, my email is jiqing.feng@intel.com
Qubitium commented 10 hours ago

Going to list the issues/diffs that we found here: (Will update as more are found)

REF: First PR that AutoGPTQ was partially merged into optimum: https://github.com/huggingface/optimum/pull/1216

Kernels:

AutoGPTQ has: Cuda/Packer, Triton v1/Packer, Triton v2/Packer, Exllama v1/Packer Exllama v2/(no-packer), Marlin/(Marlin packer)

GPTQModel has: Triton v2/Packer, Exllama v2, NM Marlin/(Marlin Packer)

Need to retest cuda vs triton v2 to see which is faster for quant and pack including with torch.compile() in torch 2.5.1 since we need to re-add back this kernel for hf/optimum compat. Unsure they will accept another, triton depend.

History:

Cuda kernel: With torch 2.5.1 changes, it may be faster or as fast as Triton v2. Again, we need to test now since optimum relies on cuda kernel by default.