EricLBuehler / mistral.rs

Blazingly fast LLM inference.
MIT License
4.47k stars 311 forks source link

Add kernel support for AArch64 specific GGUF files, i.e. Q4_0_*_* #799

Open smpurkis opened 1 month ago

smpurkis commented 1 month ago

Hello,

llama.cpp recently added support for an AArch64 specific type of GGUF and AArch64 specific matmul kernels. Here is the merged PR https://github.com/ggerganov/llama.cpp/pull/5780#pullrequestreview-21657544660

Namely Q4_0_8_8, Q4_0_4_8 and more generic Q4_0_4_4 GGUF model formats.

EricLBuehler commented 1 week ago

@smpurkis thanks for the reference. Taking a look, this is on the radar.

smpurkis commented 1 week ago

@EricLBuehler I looked through the code and saw Candle is used for quantized tensors, so I've started looking/work on adding the datatype to Candle. https://github.com/huggingface/candle/issues/2605

Could do with some guidance if that is the right place to add it?