LostRuins / koboldcpp

A simple one-file way to run various GGML and GGUF models with KoboldAI's UI
https://github.com/lostruins/koboldcpp
GNU Affero General Public License v3.0
4.34k stars 312 forks source link

[Feature request] Simple option to set process affinity as a number of cores to use #447

Open aleksusklim opened 9 months ago

aleksusklim commented 9 months ago

I have Intel Core i7-12700K (on Windows 10), it has 8 main «Performance» cores with hyperthreading, and 4 energy «Efficient» cores, giving in total 16+4=20 virtual cores.

The problem is, if I just run koboldcpp.exe as-is, then after some time, Windows will put its background process to 4 efficient cores (last ones, from 17 to 20). Yes, it is putting all threads of the process from main 16 cores to just 4 efficient cores, no matter how many threads I'll set. Performance is awful until I put the console window to the foreground, and then boom – all main cores will have 100% load with coolers instantly speeding up!

The solution is to set "process affinity" (for example in Task Manager) for the koboldcpp process, leaving only 16 first cores for it. Or to start the executable with .bat-file with something like start "koboldcpp" /AFFINITY FFFF koboldcpp.exe

But now I think that other people might have this problem too, and it is very inconvenient to use command-line or task manager – because you have such great UI with the ability to load stored configs! You can add an option, named something like "max number of cores to use:", which (if not zero) should set the process affinity mask to this number of cores, starting from 0 core. (I believe, efficient cores are always at the end, right?)

You can make a tooltip explaining that it is beneficial to put there a number of "powerful" virtual cores to increase performance; or to specifically limit the used cores by koboldcpp to leave them for other CPU-intensive applications (which plays nicely along with limiting thread count that you already have in GUI).

Do not add full affinity mask support, because most users would not understand how to set it; while those who can – they as well can start in command-line with any desired affinity.

LostRuins commented 9 months ago

Does setting the process to high priority affect this? because that is already an option

aleksusklim commented 9 months ago

I can confirm that this all-threads-on-effictient-cores behavior is in effect for me even with high priority checkbox on version 1.44.1 at default setting.

Here are screenshots of task manager, made at normal priority: before after You can see the process used only last 4 cores. Then I've switched to the koboldcpp console window – and performance physical cores started working.

It is hard to trigger it right away. I had around 10 unsuccessful attempts before, when everything was already at main cores from the start. (For some reason, I got more success when using "stream more", maybe because the browser in foreground was doing constant rendering of text).

I have more luck in triggering this by using RDP to target machine, and leaving it for a long generation. Then, when I check why it hasn't finished yet – I almost always see loaded efficient cores (and realizing that I forgot to set the affinity again!)

When I tried ticking "high priority" in GUI and leaving it now – after some time, my RDP session was disconnected and I couldn't login back; then I tried to login locally on the physical machine – it lagged badly, then I saw 0% load on main cores and constant 100% load on all 4 efficient cores, and then – system hanged completely and I had to reboot. (For me it was like Windows decided to put in efficient state ALL of the processes including system ones, but because koboldcpp was at high priority – nothing could run there in parallel anymore…)

LostRuins commented 9 months ago

@aleksusklim can i ask, what are your launch parameters? how many --threads did you start the program with?

aleksusklim commented 9 months ago

what are your launch parameters?

Correct me if I'm wrong, but I think it is impossible to use both command-line options and still have GUI to show up and take effect?

For example, if I want to use an option that is not exists in GUI, then I'll have to specify every single option along with it? That previously I set via GUI. Because of that, I don't use command-line launch anymore: I set-up my preferred settings in GUI for each model that I want to use (including path) – and then I save the config. So, all I have to do is just load a config that I want – and tweak it for current launch (for example, disabling GPU if my VRAM is busy with other tasks right now).

Back to your question: in my tests above, I left everything at defaults (except for model and streaming). I think it defaulted to 9 threads for some reason.

As for my usual set-up, I set for 16 thread and FFFF affinity, so leaving 1 thread for 1 virtual performant core, dropping 4 efficient cores for other applications. When I have GPU-intensive background task, I set threads to 15 and affinity to first 15 cores (by checkboxes in Task Manager), so that one virtual performant core is free from koboldcpp.

Personally, I don't see the point of "1 thread per 1 physical core" (instead of virtual, so I should have using 8 cores instead of 16). Still, the issue of offloading to efficient cores is in effect regardless of how many threads koboldcpp is using. (Maybe it changes the offloading probability, but does not eliminating it),

I googled, and that offloading is a common problem. I saw two recommendations: 1) Put Windows in "maximal performance" in Power settings of Control panel. – I did this, but it gave nothing for me for some reason. 2) Disable efficient cores completely in BIOS. – I do not want to do that yet, I like that when nothing to do – all my processes are using low-frequency processing, very quiet and power-friendly.

LostRuins commented 9 months ago

I have added what I hope will be a decent solution to this issue. You can now specify a launcher parameter --foreground that will bring the terminal console onto the foreground every time a new generation is started. This should hopefully prevent Windows from using E-cores instead of P-cores. Please try it out!

aleksusklim commented 9 months ago

I tried it.

The behavior is this:

I find these things VERY confusing for end-user, almost looking like bugs. Also, the user could accidentally click inside popped console and "pause" is with "text selection" feature, which would lead to users creating Issues like "koboldcpp randomly freezes" (I saw a lot of those in repos where the main program works as console process that shows its interface in browser).

The only viable use-case is headless sessions, where nobody works on the desktop of user who owns koboldcpp process. I didn't yet tested, would forced-foreground eliminate e-cores offloading with locked workspace or not, e.g.: I connect RDP, I open koboldcpp, I somehow share URL to other machine, I close RDP – then the host will be locked but operational. (To test this I would need to measure actual speed, since I won't be able to see task manager window obviously).

Instead of convincing you that just setting affinity is better because it eliminates these issues, I decided to measure speed between "full affinity" and "only p-cores affinity". I mean, even in foreground – I see my efficient cores also loaded during generation. But I have a feeling that if the process would be forcefully restricted to performant cores only – the generation speed would be slightly higher (because E-core are really slow compared to P-cores!)

I started my tests and… it crashed? I reproduced the crash several times, it looks rather confident.

Here are full logs from console: ``` *** Welcome to KoboldCpp - Version 1.45.2 For command line arguments, please refer to --help *** Attempting to use OpenBLAS library for faster prompt ingestion. A compatible libopenblas will be required. Initializing dynamic library: koboldcpp_openblas.dll ========== Overriding thread count, using 12 threads instead. Namespace(bantokens=None, blasbatchsize=2048, blasthreads=14, config=None, contextsize=8192, debugmode=False, forceversion=0, foreground=True, gpulayers=41, highpriority=False, hordeconfig=None, host='', launch=True, lora=None, model=None, model_param='C:/NN/GPT/GGML/mythalion-13b.Q5_K_M.gguf', multiuser=False, noavx2=False, noblas=False, nommap=False, onready='', port=5001, port_param=5001, psutil_set_threads=True, ropeconfig=[0.0, 10000.0], skiplauncher=False, smartcontext=False, stream=True, tensor_split=None, threads=12, unbantokens=True, useclblast=None, usecublas=None, usemirostat=None, usemlock=False) ========== Loading model: C:\NN\GPT\GGML\mythalion-13b.Q5_K_M.gguf [Threads: 12, BlasThreads: 14, SmartContext: False] --- Identified as LLAMA model: (ver 6) Attempting to Load... --- Using automatic RoPE scaling (scale:1.000, base:32000.0) System Info: AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | llama_model_loader: loaded meta data with 19 key-value pairs and 363 tensors from C:\NN\GPT\GGML\mythalion-13b.Q5_K_M.gguf (version GGUF V2 (latest)) llm_load_print_meta: format = GGUF V2 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 32000 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 4096 llm_load_print_meta: n_embd = 5120 llm_load_print_meta: n_head = 40 llm_load_print_meta: n_head_kv = 40 llm_load_print_meta: n_layer = 40 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_gqa = 1 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: n_ff = 13824 llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: model type = 13B llm_load_print_meta: model ftype = unknown, may not work llm_load_print_meta: model params = 13.02 B llm_load_print_meta: model size = 8.60 GiB (5.67 BPW) llm_load_print_meta: general.name = LLaMA v2 llm_load_print_meta: BOS token = 1 '' llm_load_print_meta: EOS token = 2 '' llm_load_print_meta: UNK token = 0 '' llm_load_print_meta: LF token = 13 '<0x0A>' llm_load_tensors: ggml ctx size = 0.12 MB llm_load_tensors: mem required = 8801.75 MB ................................................................................................... llama_new_context_with_model: n_ctx = 8192 llama_new_context_with_model: freq_base = 32000.0 llama_new_context_with_model: freq_scale = 1 llama_new_context_with_model: kv self size = 6400.00 MB llama_new_context_with_model: compute buffer total size = 2749.89 MB Load Model OK: True Embedded Kobold Lite loaded. Starting Kobold HTTP Server on port 5001 WARNING: --unbantokens is DEPRECATED and will be removed soon! EOS unbans should now be set via the generate API. WARNING: --stream is DEPRECATED and will be removed soon! This was a Kobold Lite only parameter, which is now a proper setting toggle inside Lite. WARNING: --psutil_set_threads is DEPRECATED and will be removed soon! This parameter was generally unhelpful and unnecessary, as the defaults were usually sufficient Please connect to custom endpoint at http://localhost:5001 Force redirect to streaming mode, as --stream is set. Input: {"n": 1, "max_context_length": 8192, "max_length": 128, "rep_pen": 1.1, "temperature": 0.85, "top_p": 0.85, "top_k": 0, "top_a": 0, "typical": 1, "tfs": 1, "rep_pen_range": 1024, "rep_pen_slope": 0.7, "sampler_order": [6, 0, 1, 3, 4, 2, 5], "genkey": "KCPP7213", "prompt": "<|system|>Enter RP mode. Pretend to be Albert Einstein at his prime of life. You shall reply to the user while staying in character, and generate long responses.\n<|user|>Where were you born?\n<|model|>", "quiet": true, "stop_sequence": ["<|user|>", "<|model|>", "\n", "<", "|"], "use_default_badwordsids": false} Processing Prompt [BLAS] (59 / 59 tokens) Generating (25 / 128 tokens) (Stop sequence triggered: < >) Time Taken - Processing:8.9s (151ms/T), Generation:5.2s (208ms/T), Total:14.1s (1.8T/s) Output: I was born on March 14, 1879, in Ulm, Württemberg, Germany. Exception ignored in: Traceback (most recent call last): File "tkinter\__init__.py", line 363, in __del__ RuntimeError: main thread is not in main loop Exception ignored in: Traceback (most recent call last): File "tkinter\__init__.py", line 363, in __del__ RuntimeError: main thread is not in main loop Exception ignored in: Traceback (most recent call last): File "tkinter\__init__.py", line 363, in __del__ RuntimeError: main thread is not in main loop Exception ignored in: Traceback (most recent call last): File "tkinter\__init__.py", line 363, in __del__ RuntimeError: main thread is not in main loop Exception ignored in: Traceback (most recent call last): File "tkinter\__init__.py", line 363, in __del__ RuntimeError: main thread is not in main loop ```

I don't know what happened and why. If you need more info, my config or my test history – I will provide it too. I can test this on any other version if needed.

LostRuins commented 9 months ago

Yeah, the idea of --foreground was mainly aimed at headless operation, since you mentioned using an RDP session and connecting remotely, it would ensure the application always receives priority during generation.

The main reason why I don't want to add /AFFINITY directly is the mask required will be different for each CPU setup. You cannot just set /AFFINITY FFFF, that would simply enable the process to use the first 16 cores. You'd need to figure out which ones to enable and which to disable and that is different for every PC. There are cases where the system has only 2 P-Cores and another 8 E-Cores for example.

aleksusklim commented 9 months ago

That's why I explicitly stated that you don't need to provide full affinity support!

Just "Use first N cores", with mathematically crafted mask so that only those cores are selected:

Cores = Mask
0 = -1 (all / disabled / do nothing)
1 = 1
2 = 3
3 = 7
4 = F
5 = 1F
6 = 3F
7 = 7F
8 = FF
9 = 1FF
10 = 3FF
11 = 7FF
12 = FFF
...

(I used online affinity mask calculator, like https://bitsum.com/tools/cpu-affinity-calculator/)

I think this would suffice any E-Core enabled processor, provided the user knows just how many performant virtual cores he has. Why do you think the user would ever need to use "mask with holes"? To run "only on even-numbered virtual cores"? What would be the reason to do so, to use each physical hyperthreaded core only on half? Is there ever any efficient cores that come before (or interleaved with) performant ones?

I think you can give a rule of thumb: "limit used cores to double of your thread count" (so that there will be at least twice as many virtual processors, making sure all physical ones are selected). Since you already suggest setting thread count to half of all cores – this would mean "use everything", BUT in case the user knows about performant cores, or if he specifically wants to, for example, free one physical core (f.e. of 16) – he might use "cores/2-1" (= 7 of 16) for thread count and "(cores/2-1)*2" (= 14 of 16) for core count.

Just setting less threads to completely free the core is not enough: Windows still share threads between all cores (except for the case when if offloads to efficient when idle).

LostRuins commented 9 months ago

That still won't work correctly on the system I described previously (2 P-Cores and 8 E-Cores), in which case you do want to use more than just the 2 P-Cores only.

I think the actual problem you may be encountering is your OS CPU scheduler, which is too aggressive at throttling. Are you running on some sort of power-saving or energy-saving scheme? Because most of the advice I've come across is to allow the OS to handle this kind of thing.

In either case, advanced users are, like you mentioned, able to use /AFFINITY when launching the executable on their own systems. I would like to get some feedback from other people on how many are facing this issue, and what they think of this approach.

aleksusklim commented 9 months ago

the system I described previously (2 P-Cores and 8 E-Cores), in which case you do want to use more than just the 2 P-Cores only.

So what's the problem for user to set "2" and use only P-Cores available?

aleksusklim commented 9 months ago

I did some further testing on affinity in the offloaded state:

This concludes that the only solution is to prevent the process from touching any E-core. My approach of "tell how many first cores do you want to use" will work UNLESS there are processors where E-cores are interleaved with P-cores, or where E-cores come first. Are there?

aleksusklim commented 9 months ago

in which case you do want to use more than just the 2 P-Cores only.

It's worth to check, will 16+4 cores work better than 16+0 cores, in different modes. What's about the crash? It didn't happen today when I used GPU. My previous test that crashed – was on CPU only.

LostRuins commented 9 months ago

I have not encountered any crashes recently

aleksusklim commented 8 months ago

I've tested version 1.46.1 and that crash is no longer here, great. Now I can compare performance of different thread count against different affinity…

First: CLBlast with 41/41 layers offloaded (13B model, context is set to 8k) to RTX 3060. All numbers are in ms/T from "Generation" time, the best (lowest) result from 5 attemps for each mode.

Thread count \ First N cores: 20/20 (all cores) 16/20 (P-cores) 8/20 (four physical) 4/20 (two physical)
4 threads 87 85 93 96
8 threads 89 85 90 94
16 threads 95 89 93 97
20 threads 96 90 95 98

Second: OpenBLAS on CPU.

Thread count \ First N cores: 20/20 (all cores) 16/20 (P-cores) 8/20 (four physical) 4/20 (two physical)
4 threads 262 214 304 374
8 threads 228 201 235 305
16 threads 207 205 228 294
20 threads 208 213 232 294

Observations:

Since there is no performance gain when using LESS cores than "all minus efficient", the only reason why the user would want to do this – is to "free a core", but I cannot tell when exactly this might be needed.

Clearly, E-cores have negative impact on performance even when no full offloading happen! For example, by default koboldcpp suggests to use 9 threads. Here is the speed at absolutely default config (2k context) on 13B model:

Thread count \ Process affinity All 20 cores Only 16 P-cores
9 threads (default from koboldcpp for 12700K) 230 ms/T 224 ms/T
8 threads (number of physical performant processors) 226 ms/T 200 ms/T

Instead of giving the direct control of process affinity, you might implement a checkbox like "Do not use Intel E-cores", if you could programmatically detect P-cores to set affinity only to them. Looks like this would be enough!