Optimized matrix multiplications for i-quants on __aarch64__

i-quants offer better quantization quality than k-quants in the 2- and 3-bpw range, but are notoriously slow on the CPU. This PR brings a significant speedup on Arm CPU's, particularly for prompt processing. Performance is still lower than k-quants, but the performance gap is now substantially smaller.

The following table compares performance between the main branch and this PR for a 7B LLaMA model on an M2 Max CPU.

cpu_info	model_filename	size	threads	test	t/s (main)	t/s (PR)	Speedup
M2 Max (+fp16+dotprod)	iq2xxs	1.73 GiB	8	pp512	16.50	61.16	3.707
M2 Max (+fp16+dotprod)	iq2xs	1.89 GiB	8	pp512	19.09	57.42	3.008
M2 Max (+fp16+dotprod)	iq2m	2.20 GiB	8	pp512	13.32	46.37	3.481
M2 Max (+fp16+dotprod)	iq3xxs	2.41 GiB	8	pp512	12.30	48.60	3.951
M2 Max (+fp16+dotprod)	iq3m	2.90 GiB	8	pp512	12.11	49.70	4.104
M2 Max (+fp16+dotprod)	iq2xxs	1.73 GiB	4	tg128	7.73	11.03	1.427
M2 Max (+fp16+dotprod)	iq2xxs	1.73 GiB	8	tg128	14.64	20.09	1.372
M2 Max (+fp16+dotprod)	iq2xs	1.89 GiB	4	tg128	8.56	10.72	1.252
M2 Max (+fp16+dotprod)	iq2xs	1.89 GiB	8	tg128	16.17	19.91	1.231
M2 Max (+fp16+dotprod)	iq2m	2.20 GiB	4	tg128	6.34	7.44	1.174
M2 Max (+fp16+dotprod)	iq2m	2.20 GiB	8	tg128	12.03	13.60	1.106
M2 Max (+fp16+dotprod)	iq3xxs	2.41 GiB	4	tg128	5.98	6.78	1.134
M2 Max (+fp16+dotprod)	iq3xxs	2.41 GiB	8	tg128	10.93	11.94	1.092
M2 Max (+fp16+dotprod)	iq3m	2.90 GiB	4	tg128	5.62	5.95	1.059
M2 Max (+fp16+dotprod)	iq3m	2.90 GiB	8	tg128	10.39	10.71	1.031

Mozilla-Ocho / llamafile

Optimized matrix multiplications for i-quants on aarch64 #464

Mozilla-Ocho / llamafile

Optimized matrix multiplications for i-quants on __aarch64__ #464

Optimized matrix multiplications for i-quants on aarch64 #464