Faster AVX2 matrix multiplications for lgacy quants

ikawrakow commented 1 month ago

It seems some people still use the ggml legacy qunats Q4_0, Q4_1, Q5_0 and Q5_1, so here is a PR that improves matrix multiplication performance for these quants on AVX2. The gains for Q4_1, Q5_0 and Q5_1, which do not have tiniBLAS implementation are very significant, but even Q4_0 is faster than tinyBLAS (see table below).

I have gone for a templated implementation. This costs 2-3% in performance but reduces the code by at least a factor of 2. The implementation requires at least a C++14 compiler because I have used auto for the return type of two functions. Is this a problem?

Prompt processing speed for a 512-token prompt (PP-512) for a 7B LLaMA model

CPU	Quants	PP-512 (Master)	PP-512 (PR)	Speedup
Ryzen-7950X	Q4_0	114.5	130.6	1.141
Ryzen-7950X	Q4_1	66.0	138.0	2.091
Ryzen-7950X	Q5_0	55.8	126.4	2.265
Ryzen-7950X	Q5_1	54.0	126.4	2.341
Ryzen-5975WX	Q4_0	120.2	161.0	1.339
Ryzen-5975WX	Q4_1	91.3	166.8	1.827
Ryzen-5975WX	Q5_0	83.4	155.6	1.866
Ryzen-5975WX	Q5_1	77.8	162.0	2.083

The PR can also help with token generation (TG) speed. On my system TG is fully memory bound for more than 4-8 threads (depending on quantization type). So, to have a better illustration of the performance differences, here are the TG-128 results with just 2 threads on a Ryzen-7950X for a 7B LLaMA model:

CPU	Quants	TG-128 (Master)	TG-128 (PR)	Speedup
Ryzen-7950X	Q4_0	4.39	10.86	2.474
Ryzen-7950X	Q4_1	5.69	11.49	2.019
Ryzen-7950X	Q5_0	6.00	9.00	1.500
Ryzen-7950X	Q5_1	4.67	8.79	1.882

ikawrakow commented 1 month ago

@jart I adapted to head. But to get back the Ryzen-7950X performance I had to make two separate iqk_mul_map versions (one for generic AVX2 and one with AVX512F+AVX512VNNI+AVX512VL enabled).

jart commented 1 month ago

Somehow memcpy is kind of slow, so for getting 4 bytes from 2-byte-aligned data it is faster to just do or on two consecutive 16-bit entries.

I'd encourage you to work your magic on Cosmopolitan's memcpy() function. https://github.com/jart/cosmopolitan/blob/master/libc/intrin/memmove.c You can run the tests by either running make -j32 or make -j32 o//test/libc/intrin.

jart commented 1 month ago

Also, did you notice this? https://www.phoronix.com/news/Llamafile-0.8.2-More-AVX2 Congrats!

Mozilla-Ocho / llamafile

Faster AVX2 matrix multiplications for lgacy quants #405