i-quants offer better quantization quality than k-quants in the 2- and 3-bpw range, but are notoriously slow on the CPU. This PR brings a significant speedup on Arm CPU's, particularly for prompt processing. Performance is still lower than k-quants, but the performance gap is now substantially smaller.
The following table compares performance between the main branch and this PR for a 7B LLaMA model on an M2 Max CPU.
i-quants offer better quantization quality than k-quants in the 2- and 3-bpw range, but are notoriously slow on the CPU. This PR brings a significant speedup on Arm CPU's, particularly for prompt processing. Performance is still lower than k-quants, but the performance gap is now substantially smaller.
The following table compares performance between the main branch and this PR for a 7B LLaMA model on an M2 Max CPU.