zhewang1-intc commented 6 months ago

Hi, here is Zhe from Intel AI software engineering team. Thank you for creating this amazing project AutoAWQ.

Motivation

My colleagues have done some pretty good work on low-bit GEMMs (e.g. int4). We have developed a kernel template library called BesTLA, which is similar to Cutlass. BesTLA is highly optimized for x86 CPU hardware and supports the most advanced ISA(e.g. AMX, VNNI). It has shown significant performance benefits in recent MLPerf submissions.
We would like to help optimize the performance of AutoAWQ on x86 CPUs. We need some advice from the community on how to best contribute to this project.

Options

We can contribute these kernels in two different ways:

Option 1: Through a Python package

Currently, BesTLA kernels are packaged as Torch extended ops in the Intel-Extension-for-Transformers (ITREX) Python package. AutoAWQ can use these Torch extended ops directly by adding ITREX to the requirements list in its setup.py file. ITREX depends on the Torch version at build time. AutoAWQ can specify the version of ITREX via PyPI to ensure that the Torch version it depends on is consistent with the Torch version that ITREX depends on.
If users need to build from source, we can also add specific commit from the ITREX repo main branch in autoAWQ requirements, like

requirements = [
    "torch>=2.0.1",
    "ITREX @ git+https://github.com/intel/intel-extension-for-transformers.git@commit"
]

The advantages of this approach are:

It has minimal impact on the existing AutoAWQ project structure.
It makes it easier to maintain and update the CPU-side kernels. You can update the kernels by simply changing the ITREX version in setup.py.

Option 2: Through source code

We can also integrate BesTLA directly into the AutoAWQ_kernels project. However, the process of building BesTLA into Torch extended ops is quite complex. It requires using CMake to build the extension, so we need to rewrite build_ext class from setuptools to make it compatible with both Torch's BuildExtension class and CMakeBuild class. This will add a lot of content and make major changes to AutoAWQ_kernels.setup.py.
The advantages of this approach are:
It does not introduce third-party dependencies.
AutoAWQ maintainers can have better control over their project.

We welcome any suggestions. feel free to comment so that we can find the most proper manner to contribute:)

casper-hansen commented 6 months ago

Hi @zhewang1-intc, thank you for your interest. It would be incredibly exciting to make a CPU-compatible kernel available for AutoAWQ. We already have a CPU-compatible approach (dequantizing + torch matmul), but the speed is so slow that I will not release it.

It seems Option 1 is the most feasible for the integration into AutoAWQ as Option 2 has much higher complexity due to the build process. To make this work, we need:

New kernels that are compatible with the weight packing format found in WQLinear_GEMM or WQLinear_GEMVFast. They both have different formats, and the GEMVFast kernels are newer/faster and also easier to read/understand.
The kernels need to implement efficient dequantization, which will be important for speed in general. The kernel referenced runs part of the dequantization and is followed up by code that multiplies the scales and subtracts the zeros.
The process after dequantization is mostly about FP16 accumulation and running matrix multiplication.

I hope this provides a better understanding of the general implementation of quantized linear layers. I am excited to explore how we can leverage x86 kernels.

zhewang1-intc commented 6 months ago

@casper-hansen Thanks, very useful😄

hshen14 commented 6 months ago

Thanks @casper-hansen. We briefly discussed this on X about adding CPU optimizations to AutoAWQ, and we are going to create a PR soon as you suggested.

casper-hansen / AutoAWQ

[RFC] options about low-bit GEMM kernels contribution on x86 CPUs #390

Motivation

Options

Option 1: Through a Python package

Option 2: Through source code