Open jiqing-feng opened 1 week ago
Version 1.2 with ipex should be released within the next 24 hours after I merege some pr changes that will affect/simplify core api for end-user when loading and saving models. v1.2 should be stable enough for us to move forward with optimum Pr.
Version 1.2 with ipex should be released within the next 24 after I merege some pr changes that will affect/simplify core api for end-user when loading and saving models. v1.2 should be stable enough for us to move forward with optimum Pr.
Great, I only left some minus fixes for examples, please merge #540 . Please let me know when the stable version is ready. Thanks!
v1.2.1 released. We now need to plot what code/features in optimum and transformers are dependent on old auto-gptq so we can create to do list and check off each one.
The core function is here huggingface/optimum/blob/main/optimum/gptq/quantizer.py. The others are mostly lib checks or guidance in readme or code comments.
@jiqing-feng transformers calls optimum so we need to PR both at the same time.
We have another issue, which is hf gptq loading code in from_pretrained
and how GPTQConfig
is used is very detached from reality in my view and quite messy. From a dev and user perspective that does both quant and loading of quants, the current code in transformers doesn't make much sense as far as how it uses GPTQConfig
which does strange config merges. Once a model is quantized, there is no reason, nor possible, to override the model quantization config other to select the backend kernel.
We are looking at this right now and plan out which code we need change first in gptqmodel so can adapt any changes in tranformers/optium.
@jiqing-feng transformers calls optimum so we need to PR both at the same time.
We have another issue, which is hf gptq loading code in
from_pretrained
and howGPTQConfig
is used is very detached from currently reality in my view and quite messy. From a dev and user perspective that does both quant and loading of quants, the current code in transformers doesn't make much sense as far as how it usesGPTQConfig
which and doing strange strange config merges. Once a model is quantized, there is no reason, nor possible, to override the model quantization config other to select the backend kernel.We are looking at this right now and plan out which code we need change first in gptqmodel so can adapt any changes in tranformers/optium.
Actually, for ipex, we definitely need to rewrite the quantization config so we can use our IPEX API. The IPEX API adapted the original GPTQ weight format even if you quantize the model in the cuda backend.
If it is not easy to understand, we can discuss it in a Teams meeting if you're convenient, and give me your email and your available time slot.
We have identified a problem with hf transformer, and our gptqmodel code too, in which the separation of quantization temp attributes
used only for the quantization process and the persistent attributes
of quantized model.
For example, damp
is a ephemeral attribute that only exist in the quantization stage and should not persist in the config post quantization, or should only exist in the meta
attribute if any. bits
and group_size
are persistent attributes that is both a quantization process attribute and a quantized model attribute (used for loading and dequant). The analogy for this is batch
of model training where the attribute batch
is not saved, nor should it be, or used in the saving/loading of quantized/trained models.
I plan to address this part in our PRs.
Actually, for ipex, we definitely need to rewrite the quantization config so we can use our IPEX API. The IPEX API adapted the original GPTQ weight format even if you quantize the model in the cuda backend.
Can you give me a code example of where IPEX would need to alter the persistent quantized config attributes post-quantization? (as it related to the quantization_config that persist in the json file or config.json) One example will help a lot to see where IPEX's usage case is coming from. Thanks.
If it is not easy to understand, we can discuss it in a Teams meeting if you're convenient, and give me your email and your available time slot.
You can email me at qubitium@modelcloud.ai and my time is pretty flexible.
I will take AWQ as an example because it's already integrated into transformers. Please install transformers and AutoAWQ from the main repo, and run the following script on an Intel Xeon CPU. If you don't have such a device, I will show this case in our meeting, maybe 2pm in Beijing time tomorrow (11/15) ?
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, AwqConfig
model_id = "PrunaAI/JackFram-llama-68m-AWQ-4bit-smashed"
text = ["I am happy because", "This is"]
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token
input_ids = tokenizer(text, return_tensors="pt", padding=True)
quantization_config = AwqConfig(version="ipex")
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="cpu", quantization_config=quantization_config)
model.generation_config.cache_implementation = "static"
model.generate(**input_ids)
quantization_config = AwqConfig(version="ipex")
And GPTQModel equivalent, if we change transformer code, would be GPTQConfig(backend="ipex")
, This is what IPEX needs right? A way to pass in a backend selector?
run the following script on an Intel Xeon CPU
We only have consumer Intel 13th gen and EPYC 7003 (Zen3) and 7950X (zen4 desktop) both has AVX512. Which intel instructions does IPEX require?
maybe 2pm in Beijing time tomorrow (11/15)
Sure. Please email me your contacts and we can take from there.
Going to list the issues/diffs that we found here: (Will update as more are found)
REF: First PR that AutoGPTQ was partially merged into optimum: https://github.com/huggingface/optimum/pull/1216
{exallam_config: 1}
. use_triton
always False. Optimum defaults to cuda + exllama v1. GPTQModel, deprecated cuda and exallama v1 kernel and only using triton_v2
for quantization stage. Auto**.from_pretrained
will auto start quantization if GPTQConfig
is passed in. AutoGPTQ has: Cuda/Packer, Triton v1/Packer, Triton v2/Packer, Exllama v1/Packer Exllama v2/(no-packer), Marlin/(Marlin packer)
GPTQModel has: Triton v2/Packer, Exllama v2, NM Marlin/(Marlin Packer)
Need to retest cuda vs triton v2 to see which is faster for quant and pack including with torch.compile() in torch 2.5.1 since we need to re-add back this kernel for hf/optimum compat. Unsure they will accept another, triton depend.
History:
Cuda kernel: With torch 2.5.1 changes, it may be faster or as fast as Triton v2. Again, we need to test now since optimum relies on cuda kernel by default.
Hi @Qubitium . Since the CPU path is already in gptqmodel, when do you plan to replace auto_gptq to gptqmodel in HuggingFace/optimum? I think we can start an issue in Optimum to let the maintainer know as early as possible.
Please let me know if there is anything I can do to move on to the goal. Thx.