Open AndreasKunar opened 5 days ago
The problem seem that Q4_0 is being converted to Q4_0_4_8, which has bad performance in this CPU. If patched so that it is converted to Q4_0_4_4, the performance is as expected, e.g.:
diff --git a/ggml/src/ggml-cpu/ggml-cpu-aarch64.c b/ggml/src/ggml-cpu/ggml-cpu-aarch64.c
index 96a16dfb..ff27f9cb 100644
--- a/ggml/src/ggml-cpu/ggml-cpu-aarch64.c
+++ b/ggml/src/ggml-cpu/ggml-cpu-aarch64.c
@@ -3549,7 +3549,7 @@ enum ggml_type ggml_aarch64_get_optimal_repack_type(const struct ggml_tensor * c
return GGML_TYPE_Q4_0_8_8;
}
if (ggml_cpu_has_neon() && ggml_cpu_has_matmul_int8()) {
- return GGML_TYPE_Q4_0_4_8;
+ //return GGML_TYPE_Q4_0_4_8;
}
if (ggml_cpu_has_neon()) {
return GGML_TYPE_Q4_0_4_4;
@chaxu01 does the logic for choosing the type to repack to need to be updated?
I can confirm that on Apple, ggml_arm_arch_features.has_i8mm
ends up being true
, but the defined(__ARM_FEATURE_MATMUL_INT8)
macro is false
, so the logic decides to repack, but then does not use the optimized path and fallbacks to the slow scalar implementation.
@AndreasKunar : If you build with cmake, try adding something like -DCMAKE_C_FLAGS=-march=armv8.2a+i8mm+dotprod -DCMAKE_CXX_FLAGS=-march=armv8.2a+i8mm+dotprod to the build command. This causes __ARM_FEATURE_MATMUL_INT8 to be defined.
@AndreasKunar I would like to confirm whether you built with the arch=...+i8mm option. Due to runtime CPU feature detection, we decided to repack into Q4_0_4_8 as it is supported by the CPU. However, when executing the kernel, the reference implementation is used because the optimized kernel was not included in the build.
@AndreasKunar : If you build with cmake, try adding something like -DCMAKE_C_FLAGS=-march=armv8.2a+i8mm+dotprod -DCMAKE_CXX_FLAGS=-march=armv8.2a+i8mm+dotprod to the build command. This causes __ARM_FEATURE_MATMUL_INT8 to be defined.
@eddnjjn + @chaxu01: I normally do not build with cmake, I built as in my initial report with "make -j llama-bench" and no switches. This is also in accordance with the llama-performance instructions (discussion #4167, which is essential and still current).
I tried a cmake build with your -march flags - it does not degrade performance: cmake -DCMAKE_C_FLAGS=-march=armv8.2a+i8mm+dotprod -DCMAKE_CXX_FLAGS=-march=armv8.2a+i8mm+dotprod -B build cmake --build build --config Release --target llama-bench ./build/bin/llama-bench -p 512 -n 128 -t 4 -ngl 0 -m ...
model | size | params | backend | threads | test | t/s |
---|---|---|---|---|---|---|
llama 7B Q4_0 | 3.56 GiB | 6.74 B | Metal,BLAS | 4 | pp512 | 56.64 ± 1.37 |
llama 7B Q4_0 | 3.56 GiB | 6.74 B | Metal,BLAS | 4 | tg128 | 13.76 ± 0.15 |
llama 7B Q4_0_4_4 | 3.56 GiB | 6.74 B | Metal,BLAS | 4 | pp512 | 59.08 ± 1.13 |
llama 7B Q4_0_4_4 | 3.56 GiB | 6.74 B | Metal,BLAS | 4 | tg128 | 12.86 ± 0.94 |
build: 1bb30bf2 (4149)
I understand, that the problem is due to the repack-on-load code (which is great because it avoids having to manually convert Q4_0 to Q4_0_4_4 or _8). It's a great idea, thanks, and it helps a lot also on Macs with containers/VMs. This is why I initially included the Q4_0_4_4 performance numbers (w/o repack).
But if there are now special build flags or using cmake instead of make needed (especially with very common hardware like the Apple silicon Macs), I think it needs to be documented thoroughly in the llama.cpp build instructions and also in the Mac performance comparison build instructions (discussion #4167),...
I also understand, that in the future build might be best standardized on cmake and make depreciated (see on-going discussion issue #10268 ). But until now and as documented in the build instructions,... "make" always worked nicely for common hardware+OS combinations. If it does not work anymore, it is a severe bug and needs to be documented thoroughly!
P.S.: please tell me, if I can support you or help. I'm currently on vacation and a bit slow in respponding, apologies.
@AndreasKunar Thanks for the verification. I agree that the llama.cpp build instructions could be updated to better reflect the runtime repack case.
At least the ggml_cpu_has_xxx
functions should also check if the build supports the feature and return false otherwise. If the ARM features are not automatically enabled with -march=native
, it may also be good to enable them by default, since they are checked at runtime anyway.
I will reopen this since the issue with the ARM features not being automatically enabled with the default native builds is still not addressed.
@slaren One option for fixing this could be to update the CMake script to conditionally add the appropriate compile options based on the detected CPU capabilities. This would ensure that the required ARM features, such as -march=armv8.2a and any specific flags like i8mm, are enabled when supported by the hardware.
Yes, that should work. Ideally it would work in the same way as x86, and automatically enable all the features of the local CPU when GGML_NATIVE
is enabled, and otherwise individual features can be enabled manually.
Thanks for confirming. I’ll work on updating the CMake script to ensure that, when GGML_NATIVE is enabled, all supported ARM features of the local CPU are automatically detected and enabled.
What happened?
Prior to PR #9921 / Version 4081 the -ngl 0 Q4_0 llama performance was significantly higher (more than 10x) than afterwards. (hardware: Apple MacBook Air M2 10 GPU 24GB RAM)
with ngl 99: | llama 7B Q4_0 | 3.56 GiB | 6.74 B | Metal,BLAS | 4 | pp512 | 194.94 ± 0.07 | | llama 7B Q4_0 | 3.56 GiB | 6.74 B | Metal,BLAS | 4 | tg128 | 11.81 ± 6.53 |
build: ae8de6d5 (4080)
with ngl 99: | llama 7B Q4_0 | 3.56 GiB | 6.74 B | Metal,BLAS | 4 | pp512 | 186.02 ± 13.18 | | llama 7B Q4_0 | 3.56 GiB | 6.74 B | Metal,BLAS | 4 | tg128 | 11.25 ± 3.42 |
build: 1607a5e5 (4081)
The variations except for -ngl 0 / Q4_0 might be due to the MacBook Air's thermals.
Name and Version
Apple clang version 16.0.0 (clang-1600.0.26.4) Target: arm64-apple-darwin24.1.0 Thread model: posix macOS Sequoia 15.1.1
What operating system are you seeing the problem on?
Mac
Relevant log output
No response