Implementation and performance on CPU's

HanGuo97 / flute

Fast Matrix Multiplications for Lookup Table-Quantized LLMs

https://arxiv.org/abs/2407.10960

Apache License 2.0

188 stars 6 forks source link

Implementation and performance on CPU's #4

Closed vineel96 closed 3 months ago

vineel96 commented 3 months ago

Hello @HanGuo97, This fast matrix multiplication kernel works on CPU's also? Have you done any experiments on CPU's, if any?

HanGuo97 commented 3 months ago

Hi, thanks for the question!

Unfortunately, we do not have CPU support.

vineel96 commented 3 months ago

Thanks for the reply @HanGuo97. Is there any plan/possibility to extend this work on CPU's ? also any references/links in this regard is helpful. Thanks.

HanGuo97 commented 3 months ago

FLUTE makes a few assumptions about the hardware platform. Because of this, FLUTE is designed specifically for NVIDIA GPUs. We do have near term plans to extend FLUTE to additional Ampere/Ada-generation GPUs, but CPU is, unfortunately, not in our plan at the moment. We are happy to help out if you are interested in extending to CPUs, though.

As for the references, what are kind of references are you looking for?

vineel96 commented 3 months ago

@HanGuo97,

Is it feasible to extend the work on CPU's specifically ARM? From your expertise, what can be challenges? can we expect same performance boost? Since you have compared to torch.mm and bitblas for GPU's, on CPU's there is optimized JIT'ed library know as oneDNN which use BRGEMM(batch reduce gemm),an optimized matmul which is more optimized. So any chance you have compared FLUTE with BRGEMM algorithm?
References I am looking for is in regard to optimized execution of matrix multiplication on CPU's that speeds up LLM's.

HanGuo97 commented 3 months ago

I think some of the "ideas" used in FLUTE might be useful for CPUs. For example, offline partitioning to reduce runtime re-ordering before hardware-accelerated matmul intrinsics --- although I'm not familiar with what kind of intrinsics are available in CPUs. I had very little high performance CPU programming experience, so it's a bit hard for me to judge.

The lineup of works related to tensor compiler (also referred as ML compilation) could be useful. For example, this is a very good reference: https://mlc.ai/

vineel96 commented 3 months ago

@HanGuo97 Thanks for the insights and links, will get back to you if I have any further doubts.

Also have you compared the speed ups for hugging face models like bert, TXL, VIT models?
In speedups graph against torch.mm, Marlin is at top for many cases, any specific reason why FLUTE could not beat Marlin?
In the paper, you have mentioned that matrix should adhere to layout specifications? Can you specify what exactly is layout specification for which data reordering is done? Also what exactly is weight data reordering?

HanGuo97 commented 3 months ago

Great to hear!

FLUTE supports a somewhat limited set of models at the moment. This is not a hard limitation, and we are actively working on integrating FLUTE with a wider array of models.
This is a great question. Here's a somewhat technical answer. FLUTE is, loosely speaking, more general than Marlin as integer quantization is a special case of LUT-quantization with a particular choice of LUT. The generality did come with a cost. While the former could use more advanced optimizations through bit-level manipulation entirely in registers, FLUTE needs to read the table in shared memory. This introduces extra memory accesses, which are expensive in memory-bound settings like LLM inference. As such, we consider Marlin to be an upper bound to FLUTE.
Unfortunately, the Tensor Core layout is somewhat hard to read. I have attached examples in Volta-generation GPUs below if you are interested.

cutlass-mma-layout.pdf