ModelCloud / GPTQModel

GPTQ based LLM model compression/quantization toolkit with accelerated inference support for both cpu/gpu via HF, vLLM, and SGLang.
Apache License 2.0
109 stars 24 forks source link

[COMPAT] HF compat (AutoModel + Optimum) #440

Open Qubitium opened 6 days ago

Qubitium commented 6 days ago

@jiqing-feng I am going to answer gptqmodel specifics here.

When you mean transformers integration you mean AutoModel loading of quantized models correct? Hf transformers moved all quantization code into optimum and we have the following integration code via monkey patch:

https://github.com/ModelCloud/GPTQModel/blob/98dc26f04c70393e8da272a83450cf4f14790b79/tests/test_transformers_integration.py#L27

is this what you are looking for?

ref: https://github.com/AutoGPTQ/AutoGPTQ/pull/737#issuecomment-2415622872

jiqing-feng commented 6 days ago

Not only AutoModel, the main block is check gptq lib here. Unless we change the check lib from auto-gptq to gptqmodel, it would be always false if we use gptqmodel.

Also in quantizer_gptq

Qubitium commented 6 days ago

@jqing-feng Ok I see the chicken and egg here. Our integration only tested/patched model loading via optimum loading but the code you ref is actually hf transformer calling autogptq for model quantization.

jiqing-feng commented 6 days ago

Not only AutoModel, the main block is check gptq lib here. Unless we change the check lib from auto-gptq to gptqmodel, it would be always false if we use gptqmodel.

Also in quantizer_gptq

Hi @Qubitium . Sorry for misunderstanding your point, I will check the possibility. Thanks!

jiqing-feng commented 6 days ago

Please see optimum/gptq, it also use auto_gptq lib, so we can only upstream in AutoGPTQ. Intel CPU path want to keep the same usage as CUDA to make it more user-friendly. Thanks for your investigation. I think we can focus on how to upstream on AutoGPTQ.

yao-matrix commented 1 day ago

@Qubitium , do you have any plan to integrate gptqmodel into transformers, like what eetq and autogptq does? Thx.

Qubitium commented 1 day ago

@yao-matrix Yes. This is our goal but not the ultimate goal. Our primary goal is max model compat (new models), and quant model compat with vllm/sglang, plus quant speed and quant quality recovery. Api backward compat is not our primary goal right now. Once I feel like our api is stable, very soon, we will submit prs to Transformers/Optimum to replace autogptq as much as possible. There are many reasons AutoGPTQ is not getting proper updates and the problem will become worse and worse my view.

jiqing-feng commented 1 day ago

@yao-matrix Yes. This is our goal but not the ultimate goal. Our primary goal is max model compat (new models), and quant model compat with vllm/sglang, plus quant speed and quant quality recovery. Api backward compat is not our primary goal right now. Once I feel like our api is stable, very soon, we will submit prs to Transformers/Optimum to replace autogptq as much as possible. There are many reasons AutoGPTQ is not getting proper updates and the problem will become worse and worse my view.

Great, I will add the ipex feature into GPTQModel. BTW, do you think we could finish the replacement in HF/optimum at the end of this year? I would like to help with it. Thx!

Qubitium commented 23 hours ago

BTW, do you think we could finish the replacement in HF/optimum at the end of this year? I would like to help with it. Thx!

That's great! We welcome contribution from anyone that is willing to improve this project. We are confident, once you started to work within the gpgtqmodel internals/framework, you will not want to switch back to autogptq for any reason. =)

Definitely we can by end of 2024. But we are also bound by the review process of these projects. We have had lm-eval PR for gptqmodel active with no response for like 3 months with no activity or feedback. So it really depends on how fast they react. https://github.com/EleutherAI/lm-evaluation-harness/pull/2217

Qubitium commented 23 hours ago

@jiqing-feng For IPEX code, please add a small unit test in tests. The CI is not automatic but we can trigger it manually on our 4090 action-hub instances when code is ready for verification. Every major feature/kernel will be CI tested/validated for regressions for future releases. We also plan to make sure every single model that we support has a CI test as well since regressions in Model quantization/inference is highly likely due to HF transformer and tokenizer updates.

jiqing-feng commented 23 hours ago

Hi @Qubitium . Thanks for your support. I will let you review once the PR is ready. For lm-eval, I think you can fix the failed test(due to code style) and then let the maintainer review :)