Mozilla-Ocho / llamafile

Distribute and run LLMs with a single file.
https://llamafile.ai
Other
16.75k stars 830 forks source link

Another performance optimization for Zen4 + refactoring #435

Closed ikawrakow closed 1 month ago

ikawrakow commented 1 month ago

This PR adds the following changes

Table 1 PP-512 performance for a LLaMA-7B model on a Ryzen-7950X CPU

Quantization t/s (main) t/s (PR) Speedup
Q2_K_S 152.8 177.8 1.164
Q3_K_S 165.7 194.7 1.175
Q4_K_S 160.0 200.0 1.250
Q5_K_S 147.1 192.5 1.308
Q6_K 168.4 195.4 1.160
IQ4_XS 150.6 193.2 1.283

Table 2 PP-512 performance for Mixtral-8x7B on a Ryzen-7950X CPU

Quantization t/s (main) t/s (PR) Speedup
Q2_K_S 84.5 102.4 1.212
Q3_K_S 81.6 95.5 1.170
Q4_K_S 77.3 97.0 1.254
Q5_K_S 70.0 92.8 1.325
Q6_K 81.3 93.9 1.155
IQ4_XS 74.1 93.8 1.265

If the cost associated with unpacking the quantized values for subsequent multiply-add operations with the activations is fully amortized, we would expect to have performance independent of the quantization type. Hence, I'm now pleased to observe that this is nearly the case except for Q2_K. I'm not sure why Q2_K performance is lower for the 7B model (my guess is that the compiler fails to achieve the best ordering of memory loads into SIMD registers and SIMD operations - Q2_K is the only quant that requires a single memory load for a block of 256 weights, all others need 2 or 3), but the fact that Q2_K performs better than the others for Mixtral8x7B may indicate that memory throughput may be playing a role even for prompt processing of long prompts.

I also did a comparison with current mainline llama.cpp (commit hash 95fb0aef) to see the combined effect of all optimizations. The following table shows the results for LLaMA-v2-7B and Mixtral8x7B on my Ryzen-7950X CPU

Model Quantization t/s (llama.cpp) t/s (PR) Speedup
LLaMA-v2-7B Q2_K_S 103.8 177.8 1.713
LLaMA-v-7B Q3_K_S 80.1 194.7 2.430
LLaMA-v2-7B Q4_K_S 102.4 200.0 1.953
LLaMA-v2-7B Q5_K_S 72.8 192.5 2.643
LLaMA-v2-7B Q6_K 79.9 195.4 2.446
LLaMA-v2-7B IQ4_XS 72.2 193.2 2.675
Mixtral8x7B Q2_K_S 61.4 102.4 1.668
Mixtral-8x7B Q3_K_S 42.6 95.5 2.240
Mixtral-8x7B Q4_K_S 53.2 97.0 1.824
Mixtral-8x7B Q5_K_S 38.5 92.8 2.407
Mixtral-8x7B Q6_K 43.0 93.9 2.184
Mixtral-8x7B IQ4_XS 38.6 93.8 2.432
ikawrakow commented 1 month ago

Btw, I'm noticing that this PR results in a small performance benefit for plain AVX2 as well. On a Ryzen-5975WX I measure for a 7B LLaMA model

Quantization t/s (llamafile main) t/s (llama.cpp master) t/s (this PR) Speedup llamafile speedup llama.cpp
Q2_K_S 204.0 141.0 209.8 1.029 1.488
Q3_K_S 195.4 108.5 208.1 1.065 1.917
Q4_K_S 188.9 131.3 203.2 1.075 1.548
Q5_K_S 173.5 99.4 193.9 1.117 1.951
Q6_K 196.2 95.9 204.8 1.044 2.136
IQ4_XS 186.9 105.6 202.4 1.083 1.917
jart commented 1 month ago

One thing that would help illuminate benchmarks re: memory latency questions, would be https://www.intel.com/content/www/us/en/developer/articles/tool/intelr-memory-latency-checker.html For example, I get these measurements with my current 512gb v-color ram setup.

jart@luna:~/llamafile$ doas mlc
Intel(R) Memory Latency Checker - v3.11
Measuring idle latencies for random access (in ns)...
                Numa node
Numa node            0
       0          85.8

Measuring Peak Injection Memory Bandwidths for the system
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using traffic with the following read-write ratios
ALL Reads        :      336068.2
3:1 Reads-Writes :      175669.5
2:1 Reads-Writes :      133351.9
1:1 Reads-Writes :      132625.7
Stream-triad like:      137481.7

Measuring Memory Bandwidths between nodes within system
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using Read-only traffic type
                Numa node
Numa node            0
       0        141583.5

Measuring Loaded Latencies for the system
Using all the threads from each core if Hyper-threading is enabled
Using Read-only traffic type
Inject  Latency Bandwidth
Delay   (ns)    MB/sec
==========================
 00000  877.91   141653.7
 00002  877.48   141622.3
 00008  1017.20  140993.1
 00015  1233.81  140606.0
 00050  1177.21  141207.9
 00100  1112.67  141484.8
 00200  773.23   141676.5
jart commented 1 month ago

I noticed that my AVX2 implemetation of Q8_K quantization

Could you go into more detail on this? Did I accidentally remove that in the last sync? The last sync was 30k lines (a lot of upstream development for 2 weeks!) so sometimes things get lost in translation. The [jart] comment markers help me avoid doing that by mistake.

ikawrakow commented 1 month ago

I noticed that my AVX2 implemetation of Q8_K quantization

Could you go into more detail on this? Did I accidentally remove that in the last sync? The last sync was 30k lines (a lot of upstream development for 2 weeks!) so sometimes things get lost in translation. The [jart] comment markers help me avoid doing that by mistake.

I had added AVX2 implementation for quantizing to Q8_K in the initial PR, see quantize_row_q8_K in https://github.com/Mozilla-Ocho/llamafile/pull/394/files. I did it that way because I didn't want to fool around with Georgi's single-threaded GGML_TASK_TYPE_INIT. But I actually like what you have done better. Once GGML_TASK_TYPE_INIT is multi-threaded, there is no performance benefit from vectorizing the quantization to Q8_K (I measured with and without Q8_K AVX2 implementation and it made no measurable difference on my computer).