ggerganov / llama.cpp

LLM inference in C/C++
MIT License
65.77k stars 9.44k forks source link

Implementation of NoMAD-Attention: Enhancing LLM Inference Efficiency on CPUs #7532

Closed niranjanakella closed 2 months ago

niranjanakella commented 4 months ago

Hey @ggerganov,

Highly appreciate the fundamental work that you guys have put through. I would like to know if there is any similar SIMD implementation as this new NoMAD-Attention.

The response time of quantized models is pretty slow compared to half/full-precision versions of the models. It would be great to understand if llama.cpp support AVX2 SIMD instructions for much faster CPU inference of SLMs (Small Language Models <1B).

There is a huge push coming forward for such models and inference. So it would be great to know more about this.

Highly appreciate the affrots put into this repo.

Best Regards, Niranjan Akella

github-actions[bot] commented 2 months ago

This issue was closed because it has been inactive for 14 days since being marked as stale.