Implementation of NoMAD-Attention: Enhancing LLM Inference Efficiency on CPUs

Hey @ggerganov,

Highly appreciate the fundamental work that you guys have put through. I would like to know if there is any similar SIMD implementation as this new NoMAD-Attention.

The response time of quantized models is pretty slow compared to half/full-precision versions of the models. It would be great to understand if llama.cpp support AVX2 SIMD instructions for much faster CPU inference of SLMs (Small Language Models <1B).

There is a huge push coming forward for such models and inference. So it would be great to know more about this.

Highly appreciate the affrots put into this repo.

Best Regards, Niranjan Akella

ggerganov / llama.cpp

Implementation of NoMAD-Attention: Enhancing LLM Inference Efficiency on CPUs #7532