Closed ikawrakow closed 1 month ago
@jart I adapted to head. But to get back the Ryzen-7950X performance I had to make two separate iqk_mul_map
versions (one for generic AVX2
and one with AVX512F+AVX512VNNI+AVX512VL
enabled).
Somehow memcpy is kind of slow, so for getting 4 bytes from 2-byte-aligned data it is faster to just do or on two consecutive 16-bit entries.
I'd encourage you to work your magic on Cosmopolitan's memcpy() function. https://github.com/jart/cosmopolitan/blob/master/libc/intrin/memmove.c You can run the tests by either running make -j32
or make -j32 o//test/libc/intrin
.
Also, did you notice this? https://www.phoronix.com/news/Llamafile-0.8.2-More-AVX2 Congrats!
It seems some people still use the
ggml
legacy qunatsQ4_0, Q4_1, Q5_0
andQ5_1
, so here is a PR that improves matrix multiplication performance for these quants on AVX2. The gains forQ4_1, Q5_0
andQ5_1
, which do not have tiniBLAS implementation are very significant, but evenQ4_0
is faster than tinyBLAS (see table below).I have gone for a templated implementation. This costs 2-3% in performance but reduces the code by at least a factor of 2. The implementation requires at least a C++14 compiler because I have used
auto
for the return type of two functions. Is this a problem?Prompt processing speed for a 512-token prompt (PP-512) for a 7B LLaMA model
The PR can also help with token generation (TG) speed. On my system TG is fully memory bound for more than 4-8 threads (depending on quantization type). So, to have a better illustration of the performance differences, here are the TG-128 results with just 2 threads on a Ryzen-7950X for a 7B LLaMA model: