Closed ikawrakow closed 1 month ago
Btw, I'm noticing that this PR results in a small performance benefit for plain AVX2 as well. On a Ryzen-5975WX I measure for a 7B LLaMA model
Quantization | t/s (llamafile main) | t/s (llama.cpp master) | t/s (this PR) | Speedup llamafile | speedup llama.cpp |
---|---|---|---|---|---|
Q2_K_S | 204.0 | 141.0 | 209.8 | 1.029 | 1.488 |
Q3_K_S | 195.4 | 108.5 | 208.1 | 1.065 | 1.917 |
Q4_K_S | 188.9 | 131.3 | 203.2 | 1.075 | 1.548 |
Q5_K_S | 173.5 | 99.4 | 193.9 | 1.117 | 1.951 |
Q6_K | 196.2 | 95.9 | 204.8 | 1.044 | 2.136 |
IQ4_XS | 186.9 | 105.6 | 202.4 | 1.083 | 1.917 |
One thing that would help illuminate benchmarks re: memory latency questions, would be https://www.intel.com/content/www/us/en/developer/articles/tool/intelr-memory-latency-checker.html For example, I get these measurements with my current 512gb v-color ram setup.
jart@luna:~/llamafile$ doas mlc
Intel(R) Memory Latency Checker - v3.11
Measuring idle latencies for random access (in ns)...
Numa node
Numa node 0
0 85.8
Measuring Peak Injection Memory Bandwidths for the system
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using traffic with the following read-write ratios
ALL Reads : 336068.2
3:1 Reads-Writes : 175669.5
2:1 Reads-Writes : 133351.9
1:1 Reads-Writes : 132625.7
Stream-triad like: 137481.7
Measuring Memory Bandwidths between nodes within system
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using Read-only traffic type
Numa node
Numa node 0
0 141583.5
Measuring Loaded Latencies for the system
Using all the threads from each core if Hyper-threading is enabled
Using Read-only traffic type
Inject Latency Bandwidth
Delay (ns) MB/sec
==========================
00000 877.91 141653.7
00002 877.48 141622.3
00008 1017.20 140993.1
00015 1233.81 140606.0
00050 1177.21 141207.9
00100 1112.67 141484.8
00200 773.23 141676.5
I noticed that my AVX2 implemetation of Q8_K quantization
Could you go into more detail on this? Did I accidentally remove that in the last sync? The last sync was 30k lines (a lot of upstream development for 2 weeks!) so sometimes things get lost in translation. The [jart]
comment markers help me avoid doing that by mistake.
I noticed that my AVX2 implemetation of Q8_K quantization
Could you go into more detail on this? Did I accidentally remove that in the last sync? The last sync was 30k lines (a lot of upstream development for 2 weeks!) so sometimes things get lost in translation. The
[jart]
comment markers help me avoid doing that by mistake.
I had added AVX2
implementation for quantizing to Q8_K
in the initial PR, see quantize_row_q8_K
in https://github.com/Mozilla-Ocho/llamafile/pull/394/files. I did it that way because I didn't want to fool around with Georgi's single-threaded GGML_TASK_TYPE_INIT
. But I actually like what you have done better. Once GGML_TASK_TYPE_INIT
is multi-threaded, there is no performance benefit from vectorizing the quantization to Q8_K
(I measured with and without Q8_K
AVX2 implementation and it made no measurable difference on my computer).
This PR adds the following changes
AVX512F, AVX512VNNI, AVX512VL, AVX512BW
andAVX512DQ
are available). Improvements are in the 15-30% range on my Ryzen-7950X CPU (see Table 1 and Table 2 below)Table 1 PP-512 performance for a LLaMA-7B model on a Ryzen-7950X CPU
Table 2 PP-512 performance for Mixtral-8x7B on a Ryzen-7950X CPU
If the cost associated with unpacking the quantized values for subsequent multiply-add operations with the activations is fully amortized, we would expect to have performance independent of the quantization type. Hence, I'm now pleased to observe that this is nearly the case except for
Q2_K
. I'm not sure whyQ2_K
performance is lower for the 7B model (my guess is that the compiler fails to achieve the best ordering of memory loads into SIMD registers and SIMD operations -Q2_K
is the only quant that requires a single memory load for a block of 256 weights, all others need 2 or 3), but the fact thatQ2_K
performs better than the others for Mixtral8x7B may indicate that memory throughput may be playing a role even for prompt processing of long prompts.I also did a comparison with current mainline
llama.cpp
(commit hash95fb0aef
) to see the combined effect of all optimizations. The following table shows the results for LLaMA-v2-7B and Mixtral8x7B on my Ryzen-7950X CPU