Mozilla-Ocho / llamafile

Distribute and run LLMs with a single file.
https://llamafile.ai
Other
16.81k stars 833 forks source link

Faster AVX2 matrix multiplications for lgacy quants #405

Closed ikawrakow closed 1 month ago

ikawrakow commented 1 month ago

It seems some people still use the ggml legacy qunats Q4_0, Q4_1, Q5_0 and Q5_1, so here is a PR that improves matrix multiplication performance for these quants on AVX2. The gains for Q4_1, Q5_0 and Q5_1, which do not have tiniBLAS implementation are very significant, but even Q4_0 is faster than tinyBLAS (see table below).

I have gone for a templated implementation. This costs 2-3% in performance but reduces the code by at least a factor of 2. The implementation requires at least a C++14 compiler because I have used auto for the return type of two functions. Is this a problem?

Prompt processing speed for a 512-token prompt (PP-512) for a 7B LLaMA model

CPU Quants PP-512 (Master) PP-512 (PR) Speedup
Ryzen-7950X Q4_0 114.5 130.6 1.141
Ryzen-7950X Q4_1 66.0 138.0 2.091
Ryzen-7950X Q5_0 55.8 126.4 2.265
Ryzen-7950X Q5_1 54.0 126.4 2.341
Ryzen-5975WX Q4_0 120.2 161.0 1.339
Ryzen-5975WX Q4_1 91.3 166.8 1.827
Ryzen-5975WX Q5_0 83.4 155.6 1.866
Ryzen-5975WX Q5_1 77.8 162.0 2.083

The PR can also help with token generation (TG) speed. On my system TG is fully memory bound for more than 4-8 threads (depending on quantization type). So, to have a better illustration of the performance differences, here are the TG-128 results with just 2 threads on a Ryzen-7950X for a 7B LLaMA model:

CPU Quants TG-128 (Master) TG-128 (PR) Speedup
Ryzen-7950X Q4_0 4.39 10.86 2.474
Ryzen-7950X Q4_1 5.69 11.49 2.019
Ryzen-7950X Q5_0 6.00 9.00 1.500
Ryzen-7950X Q5_1 4.67 8.79 1.882
ikawrakow commented 1 month ago

@jart I adapted to head. But to get back the Ryzen-7950X performance I had to make two separate iqk_mul_map versions (one for generic AVX2 and one with AVX512F+AVX512VNNI+AVX512VL enabled).

jart commented 1 month ago

Somehow memcpy is kind of slow, so for getting 4 bytes from 2-byte-aligned data it is faster to just do or on two consecutive 16-bit entries.

I'd encourage you to work your magic on Cosmopolitan's memcpy() function. https://github.com/jart/cosmopolitan/blob/master/libc/intrin/memmove.c You can run the tests by either running make -j32 or make -j32 o//test/libc/intrin.

jart commented 1 month ago

Also, did you notice this? https://www.phoronix.com/news/Llamafile-0.8.2-More-AVX2 Congrats!