ggerganov / llama.cpp

LLM inference in C/C++
MIT License
64.91k stars 9.31k forks source link

Bug: Processor features are determined at compile time #9147

Open PeterStark-FJ opened 3 weeks ago

PeterStark-FJ commented 3 weeks ago

What happened?

I'm running ollama which in turn uses llama.cpp. The server has quad Intel Xeon Sapphire rapids. In the debug line for the "system info" i get:

INFO [main] system info | n_threads=160 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="139749882310656" timestamp=1724406025 total_threads=320

Which wondered me as the SPR processors have (from /proc/cpuinfo):

fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cat_l2 cdp_l3 invpcid_single intel_ppin cdp_l2 ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect avx_vnni avx512_bf16 wbnoinvd dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq la57 rdpid bus_lock_detect cldemote movdiri movdir64b enqcmd fsrm md_clear serialize tsxldtrk pconfig arch_lbr avx512_fp16 amx_tile flush_l1d arch_capabilities
vmx flags       : vnmi preemption_timer posted_intr invvpid ept_x_only ept_ad ept_1gb flexpriority apicv tsc_offset vtpr mtf vapic ept vpid unrestricted_guest vapic_reg vid ple shadow_vmcs pml ept_mode_based_exec tsc_scaling usr_wait_pause

Which means avx2 etc. are not listed correctly. I then pulled llama.cpp and checked the source. In ggml/src/ggml.c it says:

int ggml_cpu_has_avx2(void) {
#if defined(__AVX2__)
    return 1;
#else
    return 0;
#endif
}

Which means that it determines the capabilities of the processor at compile time - not at runtime. This is pretty unfortunate as software (like ollama) that is being copied over to multiple machines without re-compiling the llama.cpp part, will always assume the capabilities of the system on which it was compiled on and not the one it is running on.

There was already a discussion "Regarding detection and use of processor feature sets #535" on that topic over a year ago.

Name and Version

Source from today b3617

What operating system are you seeing the problem on?

No response

Relevant log output

See above.
jeroen-mostert commented 3 weeks ago

Some related things: #7983 discusses using SIMD Everywhere to simplify using instructions (but note that it does not offer runtime dispatching itself), things like cpu_features can be used to do runtime feature detection without having to write all the hairy assembly/intrinsic invocation yourself, and both gcc and clang of course offer multi-versioning through ___attribute__((target)) and target_clones, which is great when it works but probably not for this project as it quickly runs into OS/architecture limitations.

PeterStark-FJ commented 2 weeks ago

So, if I got you right, the output of the "system info" is rather a info on how the binary was compiled and not of the actual info. I can understand that it might be very tricky and complex to support a variety of processor features as it is not just one single swap of a shared library during runtime. Did you consider utilizing Intel's oneAPI?

jeroen-mostert commented 2 weeks ago

"Very tricky" is a bit of an overstatement; all you really need is code for detecting the appropriate features at runtime, and dispatcher code for invoking the function version that is compiled for that feature, which just involves going through a function pointer. If you have no access to multi-versioning (ifuncs) to let the compiler do this transparently this is more tedious than tricky. With C you're looking at a bunch of ugly macro code and/or copy pasting, with C++ you can leverage templates to cut down on that. But as the original conversation mentions, while conceptually not that difficult it is a substantial refactoring effort. The alternative solution discussed there of building GGML multiple times with different options and having llama.cpp load the correct version at runtime is also viable and requires no code changes for GGML (but it would require changes to llama.cpp, including introducing OS-specific loading logic, and it is a bit crude in terms of compile times and binary sizes).

Did you consider utilizing Intel's oneAPI?

That's already being done (more or less) in the form of SYCL support. But at the moment SYCL is just one backend of many, since it doesn't outperform established alternatives for hardware that is also supported through other libraries. Using oneAPI for everything would not be viable and/or introduce substantial performance regressions (with the existing implementations). It's basically solving one problem at the cost of introducing even bigger ones.

Disclaimer: I have no personal experience using oneAPI/SYCL, the above is what I've distilled from various online discussions and should be taken with a grain of salt.

PeterStark-FJ commented 2 weeks ago

Thanks for taking your time to answer this. I'll be watching SYCL closely now. :)