Open PeterStark-FJ opened 3 weeks ago
Some related things: #7983 discusses using SIMD Everywhere to simplify using instructions (but note that it does not offer runtime dispatching itself), things like cpu_features can be used to do runtime feature detection without having to write all the hairy assembly/intrinsic invocation yourself, and both gcc and clang of course offer multi-versioning through ___attribute__((target))
and target_clones
, which is great when it works but probably not for this project as it quickly runs into OS/architecture limitations.
So, if I got you right, the output of the "system info" is rather a info on how the binary was compiled and not of the actual info. I can understand that it might be very tricky and complex to support a variety of processor features as it is not just one single swap of a shared library during runtime. Did you consider utilizing Intel's oneAPI?
"Very tricky" is a bit of an overstatement; all you really need is code for detecting the appropriate features at runtime, and dispatcher code for invoking the function version that is compiled for that feature, which just involves going through a function pointer. If you have no access to multi-versioning (ifuncs) to let the compiler do this transparently this is more tedious than tricky. With C you're looking at a bunch of ugly macro code and/or copy pasting, with C++ you can leverage templates to cut down on that. But as the original conversation mentions, while conceptually not that difficult it is a substantial refactoring effort. The alternative solution discussed there of building GGML multiple times with different options and having llama.cpp load the correct version at runtime is also viable and requires no code changes for GGML (but it would require changes to llama.cpp, including introducing OS-specific loading logic, and it is a bit crude in terms of compile times and binary sizes).
Did you consider utilizing Intel's oneAPI?
That's already being done (more or less) in the form of SYCL support. But at the moment SYCL is just one backend of many, since it doesn't outperform established alternatives for hardware that is also supported through other libraries. Using oneAPI for everything would not be viable and/or introduce substantial performance regressions (with the existing implementations). It's basically solving one problem at the cost of introducing even bigger ones.
Disclaimer: I have no personal experience using oneAPI/SYCL, the above is what I've distilled from various online discussions and should be taken with a grain of salt.
Thanks for taking your time to answer this. I'll be watching SYCL closely now. :)
What happened?
I'm running ollama which in turn uses llama.cpp. The server has quad Intel Xeon Sapphire rapids. In the debug line for the "system info" i get:
Which wondered me as the SPR processors have (from /proc/cpuinfo):
Which means avx2 etc. are not listed correctly. I then pulled llama.cpp and checked the source. In ggml/src/ggml.c it says:
Which means that it determines the capabilities of the processor at compile time - not at runtime. This is pretty unfortunate as software (like ollama) that is being copied over to multiple machines without re-compiling the llama.cpp part, will always assume the capabilities of the system on which it was compiled on and not the one it is running on.
There was already a discussion "Regarding detection and use of processor feature sets #535" on that topic over a year ago.
Name and Version
Source from today b3617
What operating system are you seeing the problem on?
No response
Relevant log output