Open zhewang1-intc opened 6 months ago
Hi @zhewang1-intc, thank you for your interest. It would be incredibly exciting to make a CPU-compatible kernel available for AutoAWQ. We already have a CPU-compatible approach (dequantizing + torch matmul), but the speed is so slow that I will not release it.
It seems Option 1 is the most feasible for the integration into AutoAWQ as Option 2 has much higher complexity due to the build process. To make this work, we need:
WQLinear_GEMM
or WQLinear_GEMVFast
. They both have different formats, and the GEMVFast kernels are newer/faster and also easier to read/understand.I hope this provides a better understanding of the general implementation of quantized linear layers. I am excited to explore how we can leverage x86 kernels.
@casper-hansen Thanks, very usefulš
Thanks @casper-hansen. We briefly discussed this on X about adding CPU optimizations to AutoAWQ, and we are going to create a PR soon as you suggested.
Hi, here is Zhe from Intel AI software engineering team. Thank you for creating this amazing project AutoAWQ.
Motivation
My colleagues have done some pretty good work on low-bit GEMMs (e.g. int4). We have developed a kernel template library called BesTLA, which is similar to Cutlass. BesTLA is highly optimized for x86 CPU hardware and supports the most advanced ISA(e.g. AMX, VNNI). It has shown significant performance benefits in recent MLPerf submissions.
We would like to help optimize the performance of AutoAWQ on x86 CPUs. We need some advice from the community on how to best contribute to this project.
Options
We can contribute these kernels in two different ways:
Option 1: Through a Python package
Currently, BesTLA kernels are packaged as Torch extended ops in the Intel-Extension-for-Transformers (ITREX) Python package. AutoAWQ can use these Torch extended ops directly by adding ITREX to the requirements list in its setup.py file. ITREX depends on the Torch version at build time. AutoAWQ can specify the version of ITREX via PyPI to ensure that the Torch version it depends on is consistent with the Torch version that ITREX depends on.
If users need to build from source, we can also add specific commit from the ITREX repo main branch in autoAWQ requirements, like
The advantages of this approach are:
It has minimal impact on the existing AutoAWQ project structure.
It makes it easier to maintain and update the CPU-side kernels. You can update the kernels by simply changing the ITREX version in setup.py.
Option 2: Through source code
We can also integrate BesTLA directly into the AutoAWQ_kernels project. However, the process of building BesTLA into Torch extended ops is quite complex. It requires using CMake to build the extension, so we need to rewrite build_ext class from setuptools to make it compatible with both Torch's
BuildExtension
class andCMakeBuild
class. This will add a lot of content and make major changes to AutoAWQ_kernels.setup.py.The advantages of this approach are:
It does not introduce third-party dependencies.
AutoAWQ maintainers can have better control over their project.
We welcome any suggestions. feel free to comment so that we can find the most proper manner to contribute:)