NolanoOrg / cformers

SoTA Transformers with C-backend for fast inference on your CPU.
MIT License
311 stars 29 forks source link

Add C functions for MatMul over Int-3 Quant and Int-4 with different bin-sizes #12

Open Ayushk4 opened 1 year ago

Ayushk4 commented 1 year ago

Both int3 and int4 with bin-size greater than 32 would need empirical validation beyond Int-4 LLaMa is not enough as that study only assumed quantization of weight matrix, and not of intermediate representation as well.

It could be likely that the quality worsens when intermediate representations is also quantization to same extent of larger bin size, as that of the weight matrix. If this happens then intermediate representations will have to be quantized in a different manner than weights and new MatMul kernels will have to be added.

Originally suggested by @MarkSchmidty in https://github.com/NolanoOrg/cformers/issues/2#issuecomment-1475648776