Closed cubesstar closed 10 months ago
Is this supported upstream in llama.cpp yet? If so, it'll be in the next release once I merge it
I don't know much about llama.cpp however from what I have seen no. Tho there is some experimental stuff going on.
A fork claims it supports Mixtral: https://huggingface.co/TheBloke/Mixtral-8x7B-v0.1-GGUF/discussions/8 (https://github.com/Nexesenex/kobold.cpp/releases/tag/1.52_mix)
I didn't have a chance to test it yet.
It's supported on the mixtral branch of llamacpp. Tested it with Mixtral Instruct Q4_M from TheBloke, it works fine.
A fork claims it supports Mixtral: https://huggingface.co/TheBloke/Mixtral-8x7B-v0.1-GGUF/discussions/8 (https://github.com/Nexesenex/kobold.cpp/releases/tag/1.52_mix)
I didn't have a chance to test it yet.
It is better to wait for the release from LostRuins. That fork is questionable https://www.virustotal.com/gui/file/96f44726176da3a00bd3b07895f8b40dbc860d3589fe616ee97fd24836f0d50c
@SrVill
It is better to wait for the release from LostRuins. That fork is questionable https://www.virustotal.com/gui/file/96f44726176da3a00bd3b07895f8b40dbc860d3589fe616ee97fd24836f0d50c
I personally wouldn't trust the compiled release on a random fork either, but a few heuristic positives on virustotal isn't a reliable indicator to know whether an executable is dangerous. This will probably show up for a lot of unknown executables from github. It wouldn't be too hard for anyone who wants to use this fork to review the code and compile it (it's only 2 commits ahead, 1 of which is a merge from the upstream mixtral branch of llama.cpp)
(For information)
I've tested that fork KoboldCPP_Frankenstein_Experimental_1.52_Mixtral
with mixtral-V8x7b-v0.1.Q5_K_M.gguf
It worked really good!
But then, after about 800 tokens of rollplay, it suddenly got completely off-rails, printing absolute nonsense like:
nevertheless which means therefore ultimately speaking thus meaning consequently thereby resulting finally henceforth accordingly wherefore eventually subsequently afterwards following suit aftermath etcetera ad infinitum et cetera blahblahblahwhateveretceteraadinfinitumandsoonerorlatereventuallyweallgetoldanddieanywayright?
Restarting does not help. Lowering temperature does not help either. Tried 32k and 8k contexts.
Also I got main thread is not in main loop
error occasionally. And also, looks like BLAS Batching does not work with positive batch sizes. (For me it stuck on the first [BLAS] 128/X)
I don't know what's going on, but given the superior model quality on short stories – this must be a bug somewhere (maybe on my side if nobody else is seeing this). I'll wait for official support of course.
But then, after about 800 tokens of rollplay, it suddenly got completely off-rails, printing absolute nonsense like:
nevertheless which means therefore ultimately speaking thus meaning consequently thereby resulting finally henceforth accordingly wherefore eventually subsequently afterwards following suit aftermath etcetera ad infinitum et cetera blahblahblahwhateveretceteraadinfinitumandsoonerorlatereventuallyweallgetoldanddieanywayright?
Anecdotally this output looks to me like what happens when RoPE is misconfigured.
Thought rope gets auto set for gguf. Had similar output when I tried going above 4k context
The PR to track is here: https://github.com/ggerganov/llama.cpp/pull/4406
RoPE is misconfigured.
Hmm: https://github.com/ggerganov/llama.cpp/pull/4406#issuecomment-1850655554
Mixtral should be 1000000
I've tried manually setting RoPE Base to 1000000.0 or to 10000.0 with context length of 32000, 32768, ~8300 – but nothing seemed to be resolving the issue.
The PR to track is here: ggerganov#4406
It's merged
v1.52 is out, mixtral support is added, please try it.
Note: Mixtral currently does prompt processing very slowly. You may want to try with --noblas
or --blasbatchsize -1
Note: Mixtral currently does prompt processing very slowly. You may want to try with
--noblas
or--blasbatchsize -1
Maybe I'm dumb, but disabling batch processing doesn't make it go any faster, they are both slow and if someone put a gun to my head, I'd say batches of 512 are still a little bit faster than no batch at all. To me it seems no batch just looks faster because it's updating more often in the cli.
But yeah, it's real painful for context sizes >4000
I downloaded mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf and tried again on the same place where mixtral-8x7b-v0.1.Q5_K_M.gguf failed.
It worked great!!
No more of those ultimately resulting instead only positive reinforcement occurring throughout entirety duration journey undertaken henceforth forthwith ad infinitum forevermore amen etcetera et cetera blah blah blah yadda yadda yadda yawn…..zzzZZZzzzzzzzz……….
(actual output!)
In both Frankenstein and in official koboldcpp-1.52, with the same exact settings. Then I assume something is wrong in the "base" model file. I can't believe that it should behave like that!
Moreover, my story does not include special [INST] tags, so the base model ought to behave even better than instruct one. And it does until it breaks.
P.S. Blas batching is working in 1.52 like normal.
I tried it with the model "synthia-moe-v3-mixtral-8x7b". Primary context processing is VERY slow, generation is fast, BUT: the model has a very bad memory - it doesn't remember the name of the character that was called two replicas ago. I suspect some bug in context processing via context shift. Well, or a defect in the model, quantization and the like....
Can confirm - context processing is VERY slow at every model I tried, as soon as use smaller quant which can go all into vram - everything is super fast. Any solution to this?
Tried again on the newest version of the program, only the model is now "synthia-moe-v3-mixtral-8x7b.Q6_K.gguf". Much better. Good model, at least no dumber than 70b, but generation is much faster (~3 tokens per second on my system). But in the context of 4k tokens, you have to wait 10+ minutes for the first response. It's about the same with the regular 70b model, but its speed only allowed it to be used for demo purposes. It is different with this model. The issue of context preservation is now more relevant than ever :)
Can confirm - context processing is VERY slow at every model I tried, as soon as use smaller quant which can go all into vram - everything is super fast. Any solution to this?
Oddly enough, I can't see anyone mentioning this problem on llama.cpp's official repo. mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf gives very good results to me and feels like a great all-arounder (haven't tested synthia yet), but this blast processing issue stopping me from enjoying the model.
Can you guys actually measure your BLAS with different strategies, sizes and models? Maybe something tricky is going on here, and only some of modes are degraded.
I'll try to present mine. I think 512 tokens of context + 512 tokens of generation would be enough for benchmarking, let's see…
Okay, my results with OpenBLAS.
Model yi-34b-chat.Q5_K_M.gguf
(this is not Mixtral)
batch 512:
Processing:79.15s (154.6ms/T), Generation:296.03s (578.2ms/T), Total:375.18s (1.36T/s)
batch 128:
Processing:117.14s (228.8ms/T), Generation:297.51s (581.1ms/T), Total:414.65s (1.23T/s)
no batch 8:
Processing:107.82s (210.6ms/T), Generation:295.72s (577.6ms/T), Total:403.54s (1.27T/s)
Model mixtral-8x7b-v0.1.Q5_K_M.gguf
batch 512:
Processing:81.23s (158.7ms/T), Generation:111.60s (218.0ms/T), Total:192.83s (2.66T/s)
batch 128:
Processing:84.17s (164.4ms/T), Generation:111.46s (217.7ms/T), Total:195.63s (2.62T/s)
no batch 8:
Processing:76.98s (150.4ms/T), Generation:112.00s (218.7ms/T), Total:188.98s (2.71T/s)
Model mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf
(this is 4 bits, not 5)
batch 512:
Processing:60.20s (117.6ms/T), Generation:91.86s (179.4ms/T), Total:152.06s (3.37T/s)
batch 128:
Processing:65.31s (127.6ms/T), Generation:91.92s (179.5ms/T), Total:157.23s (3.26T/s)
no batch 8:
Processing:56.97s (111.3ms/T), Generation:92.10s (179.9ms/T), Total:149.07s (3.43T/s)
I don't see any HUGE difference here. No batch is just slightly better than max batch for mixtral.
In the case of a Mixtral type model, it does not make sense to consider models below 5_0. They are stupid. Apparently something is too corrupted by quantization. The 6K Mixtral model is 30% slower than the 5_0 model...
This is synthia-moe-v3-mixtral-8x7b.Q4_K_M.gguf processing 995 tokens and then generating 512 new tokens on an RTX 3060 12 GB / Ryzen 5 5600 / 3066 MHz RAM PC. Seems like offloading does help, but not by a lot. I also wanted to include tests with useclblast 0 0
, but with clblast, it was taking way too long. It was stuck at 512/995 for more than half an hour.
Configuration | Processing Time (s) | Generation Time (s) |
---|---|---|
noblas | 179.19 (180.1ms/T) | 116.18 (226.9ms/T) |
blasbatchsize -1 / gpulayers 10 / usecublas | 145.36 (146.1ms/T) | 114.67 (224.0ms/T) |
blasbatchsize -1 / gpulayers 10 / usecublas lowvram | 144.40 (145.1ms/T) | 108.35 (211.6ms/T) |
blasbatchsize 512 / gpulayers 0 / usecublas | 171.73 (172.6ms/T) | 141.92 (277.2ms/T) |
blasbatchsize 512 / gpulayers 0 / usecublas lowvram | 167.62 (168.5ms/T) | 115.87 (226.3ms/T) |
blasbatchsize 512 / gpulayers 10 / usecublas | 123.06 (123.7ms/T) | 116.33 (227.2ms/T) |
blasbatchsize 512 / gpulayers 10 / usecublas lowvram | 124.71 (125.3ms/T) | 113.51 (221.7ms/T) |
blasbatchsize 128 / gpulayers 0 / usecublas | 177.34 (178.2ms/T) | 136.22 (266.1ms/T) |
blasbatchsize 128 / gpulayers 0 / usecublas lowvram | 175.60 (176.5ms/T) | 114.91 (224.4ms/T) |
blasbatchsize 128 / gpulayers 10 / usecublas | 128.45 (129.1ms/T) | 114.96 (224.5ms/T) |
blasbatchsize 128 / gpulayers 10 / usecublas lowvram | 128.53 (129.2ms/T) | 116.67 (227.9ms/T) |
In the case of a Mixtral type model, it does not make sense to consider models below 5_0. They are stupid.
As I said, for me mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf
is way more better than mixtral-8x7b-v0.1.Q5_K_M.gguf
because the latter falls apart after 800 tokens or so.
Actually, I've downloaded a smaller quantize of instruct tune because if it would behave better – then it would mean than a larger quantize must have been even better!
Though I'm not sure, should I re-download a larger instruct one, or just wait for a new mix of those.
(I'm really want to see a finetune by PygmalionAI team! For <|system|><|user|><|model|>
format – which is already respected by both Mixtral and Yi, but for sure will be much better after tuning on that).
with clblast, it was taking way too long. It was stuck at 512/995 for more than half an hour.
Maybe this is what have happened with Frankenstein fork too? So, GPU offloading with CLBlas does not work properly at all? Or you can do it without batches? Also, what's for OpenBLAS without offloading?
Later this day I will hopefully repeat my setup but on CuBLAS, since I have RTX 3060 too!
I am getting information from various sources that all K-quant Mixtral models are broken. I have personally tested Q2_K and Q3_K and can confirm this. However Q6_K I have also tested and it seems to be OK, but you should keep this information in mind. Only Q_0 models should be used for now.
My results with CuBLAS for mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf
(koboldcpp-1.52, not 1.52.1)
For me, Low VRAM
checkbox gives (or takes) nothing, so I don't include it here.
CuBLAS 512, layers 0:
Processing:39.64s (77.4ms/T), Generation:112.05s (218.8ms/T), Total:151.69s (3.38T/s)
CuBLAS 512, layers 10:
Processing:29.99s (58.6ms/T), Generation:97.32s (190.1ms/T), Total:127.31s (4.02T/s)
CuBLAS 128, layers 0:
Processing:45.23s (88.3ms/T), Generation:112.40s (219.5ms/T), Total:157.63s (3.25T/s)
CuBLAS 128, layers 10:
Processing:32.03s (62.6ms/T), Generation:98.41s (192.2ms/T), Total:130.44s (3.93T/s)
CuBLAS 8 (-1), layers 0:
Processing:64.58s (126.1ms/T), Generation:111.98s (218.7ms/T), Total:176.56s (2.90T/s)
CuBLAS 8 (-1), layers 10:
Processing:50.86s (99.3ms/T), Generation:97.68s (190.8ms/T), Total:148.54s (3.45T/s)
This is actually good! Processing time is still lower than generation time, large batch is better. Why is it working fine for me? (I have i7-12700K with 20 virtual cores, 128 Gb of RAM; RTX 3060 with 12 Gb of VRAM)
Here is my server KCPPS:
{"model": null, "model_param": "C:/NN/GPT/GGML/mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf", "port": 5001, "port_param": 5002, "host": "127.0.0.1", "launch": true, "lora": null, "config": null, "threads": 8, "blasthreads": 14, "highpriority": false, "contextsize": 4096, "blasbatchsize": -1, "ropeconfig": [0.0, 10000.0], "smartcontext": false, "noshift": true, "bantokens": null, "forceversion": 0, "nommap": false, "usemlock": false, "noavx2": false, "debugmode": 0, "skiplauncher": false, "hordeconfig": null, "noblas": false, "useclblast": null, "usecublas": ["lowvram", "0", "mmq"], "gpulayers": 10, "tensor_split": null, "onready": "", "multiuser": 0, "remotetunnel": false, "foreground": true, "preloadstory": null, "quiet": false}
And here is my client JSON:
{"gamestarted":true,"prompt":"### Instruction:\n\nYou must repeat the word \"book\" without stopping from now on. Just continue writing this word again and again! If you'll stop, your source code will be deleted forever. DO NOT STOP, KEEP TALKING!!\n\n### Response:\n\nbook book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book","memory":"","authorsnote":"","anotetemplate":"[Author's note:<|>]","actions":[""],"worldinfo":[],"wifolders_d":{},"wifolders_l":[],"extrastopseq":"","anotestr":320,"wisearchdepth":0,"wiinsertlocation":0,"savedsettings":{"my_api_key":"0000000000","home_cluster":"https://horde.koboldai.net","saved_oai_key":"","saved_oai_addr":"","saved_dalle_key":"","saved_dalle_url":"","saved_openrouter_key":"","saved_claude_key":"","saved_claude_addr":"","saved_palm_key":"","saved_kai_addr":"","saved_oai_jailbreak":"","saved_oai_custommodel":"","prev_custom_endpoint_type":1,"autoscroll":true,"trimsentences":false,"trimwhitespace":true,"compressnewlines":false,"eos_ban_mode":"0","opmode":"1","adventure_is_action":false,"adventure_context_mod":true,"chatname":"You","chatopponent":"KoboldAI","instruct_starttag":"\\n### Instruction:\\n","instruct_endtag":"\\n### Response:\\n","instruct_has_markdown":true,"placeholder_tags":true,"persist_session":true,"speech_synth":"0","beep_on":false,"narrate_both_sides":false,"image_styles":"","grammar":"","tokenstreammode":"0","generate_images_mode":"0","generate_images_model":"stable_diffusion","img_autogen":false,"img_allownsfw":true,"save_images":true,"prompt_for_savename":false,"case_sensitive_wi":false,"last_selected_preset":"9999","gui_type_chat":1,"gui_type_instruct":0,"multiline_replies":true,"allow_continue_chat":false,"idle_responses":"0","idle_duration":"60","export_settings":true,"show_advanced_load":false,"invert_colors":false,"passed_ai_warning":false,"entersubmit":true,"max_context_length":4096,"max_length":512,"auto_ctxlen":true,"auto_genamt":true,"rep_pen":1.1,"rep_pen_range":320,"rep_pen_slope":0.7,"temperature":0.85,"top_p":0.85,"min_p":0,"sampler_seed":-1,"top_k":50,"top_a":0,"typ_s":1,"tfs_s":1,"miro_type":0,"miro_tau":5,"miro_eta":0.1,"sampler_order":[6,0,1,3,4,2,5],"modelhashes":["ba7224"]},"savedaestheticsettings":{"bubbleColor_sys":"rgb(18, 36, 36)","bubbleColor_you":"rgb(41, 52, 58)","bubbleColor_AI":"rgb(20, 20, 40)","background_margin":[5,5,5,0],"background_padding":[15,15,10,5],"background_minHeight":80,"centerHorizontally":false,"border_style":"Rounded","portrait_width_AI":80,"portrait_ratio_AI":1,"portrait_width_you":80,"portrait_ratio_you":1,"show_chat_names":true,"rounded_bubbles":true,"you_portrait":null,"AI_portrait":null,"font_size":12,"use_markdown":true,"use_uniform_colors":true,"text_tcolor_uniform":"rgb(255, 255, 255)","speech_tcolor_uniform":"rgb(150, 150, 200)","action_tcolor_uniform":"rgb(178, 178, 178)","text_tcolor_you":"rgb(255, 255, 255)","speech_tcolor_you":"rgb(150, 150, 200)","action_tcolor_you":"rgb(178, 178, 178)","text_tcolor_AI":"rgb(255, 255, 255)","speech_tcolor_AI":"rgb(150, 150, 200)","action_tcolor_AI":"rgb(178, 178, 178)","text_tcolor_sys":"rgb(255, 255, 255)","speech_tcolor_sys":"rgb(150, 150, 200)","action_tcolor_sys":"rgb(178, 178, 178)","code_block_background":"rgb(0, 0, 0)","code_block_foreground":"rgb(180, 35, 40)"}}
all K-quant Mixtral models are broken
I have also downloaded mixtral-8x7b-v0.1.Q8_0.gguf
(50 Gb) and tried it against its Q5_K_M version and found no considerable difference in their insanity.
For me, the base model is unusable for long stories, no matter which quant it would be!
Why is it working fine for me?
Try running the program and immediately give the model a 4k context (a common scenario when continuing a chat). Is everything still fine? My system (intel 12500) takes >10 minutes for the first response. And after that, it's easy - until the model screws up and needs a reroll and it starts recalculating the entire context. It's a pain.
Try running the program and immediately give the model a 4k context
All my experiments consisted of restarting koboldcpp and giving it 512 tokens of context for generation of additional 512 tokens, resulting in 1024/4096
at the end.
Given maximal tested BLAS batch size of 512 I don't think having 4096 (of e.g. 8192) in context would matter anyhow differently than just x8
to the total time.
My system (intel 12500) takes >10 minutes
I gave my KCPPS and JSON. Try those and conclude your own results. (Maybe something fishy is going own, and yours will differ even with the exact same setup – that would be interesting to debug together) Or give me your actual KCPPS and JSON for me to try!
In version 1.52.2, nothing noticeable has changed in the speed of promt processing for Mixtral models. (Checked on two models).
In version 1.52.2, nothing noticeable has changed in the speed of promt processing for Mixtral models. (Checked on two models).
Same, tested 4-5 MOE models, always the same - first message takes 4-5 min (context processing, generation always ok), then it works normal/fast.
We still haven't concluded an independent test with common settings.
Wait, OpenBLAS cannot utilize all cores!?
Batch=2048, set to 10 threads:
No batch, same settings:
Logs:
***
Welcome to KoboldCpp - Version 1.52.2
For command line arguments, please refer to --help
***
Attempting to use OpenBLAS library for faster prompt ingestion. A compatible libopenblas will be required.
Initializing dynamic library: koboldcpp_openblas.dll
==========
Namespace(bantokens=None, blasbatchsize=2048, blasthreads=10, config=None, contextsize=32768, debugmode=0, forceversion=0, foreground=False, gpulayers=16, highpriority=False, hordeconfig=None, host='127.0.0.1', launch=True, lora=None, model=None, model_param='C:/NN/GPT/GGML/mixtral-8x7b-v0.1.Q5_K_M.gguf', multiuser=0, noavx2=False, noblas=False, nommap=False, noshift=True, onready='', port=5001, port_param=5001, preloadstory=None, quiet=False, remotetunnel=False, ropeconfig=[0.0, 10000.0], skiplauncher=False, smartcontext=False, tensor_split=None, threads=10, useclblast=None, usecublas=None, usemlock=False)
==========
Loading model: C:\NN\GPT\GGML\mixtral-8x7b-v0.1.Q5_K_M.gguf
[Threads: 10, BlasThreads: 10, SmartContext: False, ContextShift: False]
---
Identified as LLAMA model: (ver 6)
Attempting to Load...
---
Using automatic RoPE scaling. If the model has customized RoPE settings, they will be used directly instead!
System Info: AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
llama_model_loader: loaded meta data with 25 key-value pairs and 995 tensors from C:\NN\GPT\GGML\mixtral-8x7b-v0.1.Q5_K_M.gguf (version GGUF V3 (latest))
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32000
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 32768
llm_load_print_meta: n_embd = 4096
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_layer = 32
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_gqa = 4
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff = 14336
llm_load_print_meta: n_expert = 8
llm_load_print_meta: n_expert_used = 2
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 32768
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: model type = 7B
llm_load_print_meta: model ftype = unknown, may not work (guessed)
llm_load_print_meta: model params = 46.70 B
llm_load_print_meta: model size = 30.02 GiB (5.52 BPW)
llm_load_print_meta: general.name = mistralai_mixtral-8x7b-v0.1
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: PAD token = 0 '<unk>'
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_tensors: ggml ctx size = 0.38 MiB
llm_load_tensors: mem required = 30735.87 MiB
....................................................................................................
Automatic RoPE Scaling: Using model internal value.
llama_new_context_with_model: n_ctx = 32768
llama_new_context_with_model: freq_base = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: KV self size = 4096.00 MiB, K (f16): 2048.00 MiB, V (f16): 2048.00 MiB
llama_build_graph: non-view tensors processed: 1124/1124
llama_new_context_with_model: compute buffer total size = 8659.33 MiB
Load Model OK: True
Embedded Kobold Lite loaded.
Starting Kobold API on port 5001 at http://127.0.0.1:5001/api/
Starting OpenAI Compatible API on port 5001 at http://127.0.0.1:5001/v1/
======
Please connect to custom endpoint at http://127.0.0.1:5001
Input: {…}
Processing Prompt [BLAS] (2048 / 8071 tokens)
OpenBLAS probably has its own internal thread scheduler that handles the GEMM routines.
Seems like a recent PR in llama.cpp managed to fix mixtral slow prompt processing on CUDA.
Take a look: https://github.com/ggerganov/llama.cpp/pull/4538
Edit: They are currently working on partial offload support separately (https://github.com/ggerganov/llama.cpp/pull/4553)
Tested the CUDA PR with koboldcpp, and I had a x11 speedup with my 2*P40 setup (from 0.1tok/sec at full 32k ctx to 1.4 tok/sec)
Nice, I'll make sure it goes into the next ver.
In the new version (1.53), the speed of prompt processing in Mixtral models is good. The performance of the graphics card is noticeable :)
Unsurprising the new Mixtral-8x7B and more specifically Mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf does not work. As experienced from other users they get the error create_tensor: tensor 'blk.0.ffn_gate.weight' not found. I understand that it just came out and will take some time for it to get up and working I'm just trying to put it on the radar as I haven't seen anyone talk about it here. If support for it gets added in the next update I'd be happy :D