ModelCloud / GPTQModel

Production ready LLM model compression/quantization toolkit with accelerated inference support for both cpu/gpu via HF, vLLM, and SGLang.
Apache License 2.0
132 stars 29 forks source link

[FEATURE] PIP package and robust packing API #26

Closed wenhuach21 closed 3 months ago

wenhuach21 commented 5 months ago

1 when will you provide pip package?

2 automatically backend change for each layer, as I know some backends have specific requirement, for example, bits, channel number

3 will you support layer fusing just like AWQ did

Qubitium commented 5 months ago

Hi @wenhuach21

  1. We are still working at this very minute to validate Qwen2Moe support so pip release will be ready after all the low-hanging fruits of many models added/validated. We also need to re-run all our validation tests to make sure our first public release is a good one with minimal regressions.

  2. Do you mean per layer bit and group size optimization? For example, allowing each layer to have different quantization properties so the quantization code selects the best bits/channel for quality and then on load, do the same for inference? Like what exllama v2 and gguf are doing with dynamic quantization properties that is layer specific?

  3. Haven't had time to checkout AWQ and layer fusing yet. Can you elaborate? Thanks.

wenhuach21 commented 5 months ago

2 Yes, that's correct. Another scenario to consider is when the backend does not support a specific layer, such as when channels % 32 != 0. In such cases, it's better to switch to a different backend for that layer or the entire model.

3 https://github.com/casper-hansen/AutoAWQ/tree/main/awq/modules/fused

I don't know the target of this repo. Since AutoGPTQ has been merged into Transformers, users may prefer to use the unified API in Transformers. Providing a comprehensive kernel repository for all the other algorithms would be beneficial.

Qubitium commented 5 months ago
  1. Fused attention is not faster than Marlin kernel. The old AutoGPTQ repo had fused attention suppot for some models and this feature was 1. Not applicable to all models 2. Not better than exllama v2 or Marlin. 3. Not even fully tested or validated for the models it did claim to support.

This repo's target is to use the best performant kernels for inference with and remove the rest. Just like how we removed Qigen and will replace it with Qbits.

As far as Transformer integration with AutoGPTQ, I can only say it is filled with bugs we have fixed in this repo. The first 0.9.0 release change list is incompelete and we have fixed lots of small usability bugs as well.

wenhuach21 commented 5 months ago

Cool! Is there any publicly available performance data? Considering the many scenarios, such as varying batch sizes, prefill token counts (input tokens), different devices, and generated token counts (output tokens), I could only find some kernel suggestions in the AutoAWQ repository.

Qubitium commented 5 months ago

Cool! Is there any publicly available performance data? Considering the many scenarios, such as varying batch sizes, prefill token counts (input tokens), different devices, and generated token counts (output tokens), I could only find some kernel suggestions in the AutoAWQ repository.

Very true that performance data has so much variance and add on top of that different models behave differently even in the same family class (i.e. diff sized vocabulary or layer count diff).

I think it is good to use something like Llama3 as a base benchmark since so many model are genetically similar to it. Added this to our milestone for July.

Qubitium commented 3 months ago

Closing issue as pip pkg has been uploaded.

wenhuach21 commented 3 months ago

Hi @Qubitium,

I'm exploring potential repositories that AutoRound could leverage as a backend for CUDA, given that we’re unable to release CUDA kernels within the package. I believe there could be an opportunity for collaboration here.

Since your work already supports many backends and we are all going to support mixed-bits, it would be useful if you could provide an interface that, based on the quantization configuration, bit , and layer parameters, returns an appropriate wrapper layer. This would allow us to directly integrate and call it from the AutoRound side.

By the way, if you’re Chinese, we could add each other on WeChat if you don’t mind.

Qubitium commented 3 months ago

@wenhuach21 Do you have X/Twitter? pm me at my x handle @qubitium and we can go from there. I believe our projects can work together to deduplicate the work and agree on some protocol for sharing data and such.

wenhuach21 commented 3 months ago

@wenhuach21 Do you have X/Twitter? pm me at my x handle @Qubitium and we can go from there. I believe our projects can work together to deduplicate the work and agree on some protocol for sharing data and such.

I have an account, but I rarely use it, so we can stick with GitHub. We'll try to use your repository as the CUDA backend. If you have any design updates for mixed precision or need alignment on other aspects, feel free to let me know or submit a pull request to the autoround repository.

By the way, we're planning to explore Weight-Activation quantization and have already achieved promising results. It would be great if you could support the CUDA kernels for this in the future.