NolanoOrg / cformers

SoTA Transformers with C-backend for fast inference on your CPU.
MIT License
312 stars 29 forks source link

Upload GPTQ Quantized models in 4-bit precision format for different bin-sizes to Huggingface #2

Open Ayushk4 opened 1 year ago

Ayushk4 commented 1 year ago
MarkSchmidty commented 1 year ago

Per Int-4 LLaMa is not enough - Int-3 and beyond binning with a bin size of 128 appears to reduce most of the remaining output quality loss of GPTQ for models larger than ~10B while only negligably effecting the memory requirement.

GPTQ-for-LLaMa, one of the first GPTQ projects, is already moving towards 3bit with binning being the new default.

Given that memory bandwidth is the major bottleneck on CPU, fewer bits means faster inference. For models large enough (~10B+) 3bit GPTQ with binning may be the way to go.

Ayushk4 commented 1 year ago

Thanks for the suggestion @MarkSchmidty . I am opening a separate issue (#12) as this will require new C/CPP kernels to be added as well.

In short Int-4 LLaMa is not enough study assumed that only weight was being quantized, not intermediate representations. We need to either add new kernels or do another study of the performance drop.