Another performance optimization for Zen4 + refactoring

ikawrakow commented 1 month ago

This PR adds the following changes

Improved k-quants prompt processing performance for Zen4 (AVX512F, AVX512VNNI, AVX512VL, AVX512BW and AVX512DQ are available). Improvements are in the 15-30% range on my Ryzen-7950X CPU (see Table 1 and Table 2 below)
Much nicer implementation - compared to the previous version, code size has increased by just ~150 LOC despite having two completely separate implementations for Zen4 and vanilla AVX2

Table 1 PP-512 performance for a LLaMA-7B model on a Ryzen-7950X CPU

Quantization	t/s (main)	t/s (PR)	Speedup
Q2_K_S	152.8	177.8	1.164
Q3_K_S	165.7	194.7	1.175
Q4_K_S	160.0	200.0	1.250
Q5_K_S	147.1	192.5	1.308
Q6_K	168.4	195.4	1.160
IQ4_XS	150.6	193.2	1.283

Table 2 PP-512 performance for Mixtral-8x7B on a Ryzen-7950X CPU

Quantization	t/s (main)	t/s (PR)	Speedup
Q2_K_S	84.5	102.4	1.212
Q3_K_S	81.6	95.5	1.170
Q4_K_S	77.3	97.0	1.254
Q5_K_S	70.0	92.8	1.325
Q6_K	81.3	93.9	1.155
IQ4_XS	74.1	93.8	1.265

If the cost associated with unpacking the quantized values for subsequent multiply-add operations with the activations is fully amortized, we would expect to have performance independent of the quantization type. Hence, I'm now pleased to observe that this is nearly the case except for Q2_K. I'm not sure why Q2_K performance is lower for the 7B model (my guess is that the compiler fails to achieve the best ordering of memory loads into SIMD registers and SIMD operations - Q2_K is the only quant that requires a single memory load for a block of 256 weights, all others need 2 or 3), but the fact that Q2_K performs better than the others for Mixtral8x7B may indicate that memory throughput may be playing a role even for prompt processing of long prompts.

I also did a comparison with current mainline llama.cpp (commit hash 95fb0aef) to see the combined effect of all optimizations. The following table shows the results for LLaMA-v2-7B and Mixtral8x7B on my Ryzen-7950X CPU

Model	Quantization	t/s (llama.cpp)	t/s (PR)	Speedup
LLaMA-v2-7B	Q2_K_S	103.8	177.8	1.713
LLaMA-v-7B	Q3_K_S	80.1	194.7	2.430
LLaMA-v2-7B	Q4_K_S	102.4	200.0	1.953
LLaMA-v2-7B	Q5_K_S	72.8	192.5	2.643
LLaMA-v2-7B	Q6_K	79.9	195.4	2.446
LLaMA-v2-7B	IQ4_XS	72.2	193.2	2.675
Mixtral8x7B	Q2_K_S	61.4	102.4	1.668
Mixtral-8x7B	Q3_K_S	42.6	95.5	2.240
Mixtral-8x7B	Q4_K_S	53.2	97.0	1.824
Mixtral-8x7B	Q5_K_S	38.5	92.8	2.407
Mixtral-8x7B	Q6_K	43.0	93.9	2.184
Mixtral-8x7B	IQ4_XS	38.6	93.8	2.432

ikawrakow commented 1 month ago

Btw, I'm noticing that this PR results in a small performance benefit for plain AVX2 as well. On a Ryzen-5975WX I measure for a 7B LLaMA model

Quantization	t/s (llamafile main)	t/s (llama.cpp master)	t/s (this PR)	Speedup llamafile	speedup llama.cpp
Q2_K_S	204.0	141.0	209.8	1.029	1.488
Q3_K_S	195.4	108.5	208.1	1.065	1.917
Q4_K_S	188.9	131.3	203.2	1.075	1.548
Q5_K_S	173.5	99.4	193.9	1.117	1.951
Q6_K	196.2	95.9	204.8	1.044	2.136
IQ4_XS	186.9	105.6	202.4	1.083	1.917

jart commented 1 month ago

One thing that would help illuminate benchmarks re: memory latency questions, would be https://www.intel.com/content/www/us/en/developer/articles/tool/intelr-memory-latency-checker.html For example, I get these measurements with my current 512gb v-color ram setup.

jart@luna:~/llamafile$ doas mlc
Intel(R) Memory Latency Checker - v3.11
Measuring idle latencies for random access (in ns)...
                Numa node
Numa node            0
       0          85.8

Measuring Peak Injection Memory Bandwidths for the system
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using traffic with the following read-write ratios
ALL Reads        :      336068.2
3:1 Reads-Writes :      175669.5
2:1 Reads-Writes :      133351.9
1:1 Reads-Writes :      132625.7
Stream-triad like:      137481.7

Measuring Memory Bandwidths between nodes within system
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using Read-only traffic type
                Numa node
Numa node            0
       0        141583.5

Measuring Loaded Latencies for the system
Using all the threads from each core if Hyper-threading is enabled
Using Read-only traffic type
Inject  Latency Bandwidth
Delay   (ns)    MB/sec
==========================
 00000  877.91   141653.7
 00002  877.48   141622.3
 00008  1017.20  140993.1
 00015  1233.81  140606.0
 00050  1177.21  141207.9
 00100  1112.67  141484.8
 00200  773.23   141676.5

jart commented 1 month ago

I noticed that my AVX2 implemetation of Q8_K quantization

Could you go into more detail on this? Did I accidentally remove that in the last sync? The last sync was 30k lines (a lot of upstream development for 2 weeks!) so sometimes things get lost in translation. The [jart] comment markers help me avoid doing that by mistake.

ikawrakow commented 1 month ago

I noticed that my AVX2 implemetation of Q8_K quantization

Could you go into more detail on this? Did I accidentally remove that in the last sync? The last sync was 30k lines (a lot of upstream development for 2 weeks!) so sometimes things get lost in translation. The [jart] comment markers help me avoid doing that by mistake.

I had added AVX2 implementation for quantizing to Q8_K in the initial PR, see quantize_row_q8_K in https://github.com/Mozilla-Ocho/llamafile/pull/394/files. I did it that way because I didn't want to fool around with Georgi's single-threaded GGML_TASK_TYPE_INIT. But I actually like what you have done better. Once GGML_TASK_TYPE_INIT is multi-threaded, there is no performance benefit from vectorizing the quantization to Q8_K (I measured with and without Q8_K AVX2 implementation and it made no measurable difference on my computer).

Mozilla-Ocho / llamafile

Another performance optimization for Zen4 + refactoring #435