Saving Keys and Values Cache at lower precision

NolanoOrg / cformers

SoTA Transformers with C-backend for fast inference on your CPU.

MIT License

311 stars 29 forks source link

Open Ayushk4 opened 1 year ago

Ayushk4 commented 1 year ago

Refer https://github.com/FMInference/FlexGen - they have explored storing cache at 4-bit quantization.