casper-hansen / AutoAWQ

AutoAWQ implements the AWQ algorithm for 4-bit quantization with a 2x speedup during inference. Documentation:
https://casper-hansen.github.io/AutoAWQ/
MIT License
1.63k stars 195 forks source link

[RFC] options about low-bit GEMM kernels contribution on x86 CPUs #390

Open zhewang1-intc opened 6 months ago

zhewang1-intc commented 6 months ago

Hi, here is Zhe from Intel AI software engineering team. Thank you for creating this amazing project AutoAWQ.

Motivation

My colleagues have done some pretty good work on low-bit GEMMs (e.g. int4). We have developed a kernel template library called BesTLA, which is similar to Cutlass. BesTLA is highly optimized for x86 CPU hardware and supports the most advanced ISA(e.g. AMX, VNNI). It has shown significant performance benefits in recent MLPerf submissions.
We would like to help optimize the performance of AutoAWQ on x86 CPUs. We need some advice from the community on how to best contribute to this project.

Options

We can contribute these kernels in two different ways:

Option 1: Through a Python package

Currently, BesTLA kernels are packaged as Torch extended ops in the Intel-Extension-for-Transformers (ITREX) Python package. AutoAWQ can use these Torch extended ops directly by adding ITREX to the requirements list in its setup.py file. ITREX depends on the Torch version at build time. AutoAWQ can specify the version of ITREX via PyPI to ensure that the Torch version it depends on is consistent with the Torch version that ITREX depends on.
If users need to build from source, we can also add specific commit from the ITREX repo main branch in autoAWQ requirements, like

requirements = [
    "torch>=2.0.1",
    "ITREX @ git+https://github.com/intel/intel-extension-for-transformers.git@commit"
]

The advantages of this approach are:

We welcome any suggestions. feel free to comment so that we can find the most proper manner to contribute:)

casper-hansen commented 6 months ago

Hi @zhewang1-intc, thank you for your interest. It would be incredibly exciting to make a CPU-compatible kernel available for AutoAWQ. We already have a CPU-compatible approach (dequantizing + torch matmul), but the speed is so slow that I will not release it.

It seems Option 1 is the most feasible for the integration into AutoAWQ as Option 2 has much higher complexity due to the build process. To make this work, we need:

  1. New kernels that are compatible with the weight packing format found in WQLinear_GEMM or WQLinear_GEMVFast. They both have different formats, and the GEMVFast kernels are newer/faster and also easier to read/understand.
  2. The kernels need to implement efficient dequantization, which will be important for speed in general. The kernel referenced runs part of the dequantization and is followed up by code that multiplies the scales and subtracts the zeros.
  3. The process after dequantization is mostly about FP16 accumulation and running matrix multiplication.

I hope this provides a better understanding of the general implementation of quantized linear layers. I am excited to explore how we can leverage x86 kernels.

zhewang1-intc commented 6 months ago

@casper-hansen Thanks, very usefulšŸ˜„

hshen14 commented 6 months ago

Thanks @casper-hansen. We briefly discussed this on X about adding CPU optimizations to AutoAWQ, and we are going to create a PR soon as you suggested.