LostRuins / koboldcpp

Run GGUF models easily with a KoboldAI UI. One File. Zero Install.
https://github.com/lostruins/koboldcpp
GNU Affero General Public License v3.0
5.49k stars 368 forks source link

Mixtral Support When? #557

Closed cubesstar closed 10 months ago

cubesstar commented 11 months ago

Unsurprising the new Mixtral-8x7B and more specifically Mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf does not work. As experienced from other users they get the error create_tensor: tensor 'blk.0.ffn_gate.weight' not found. I understand that it just came out and will take some time for it to get up and working I'm just trying to put it on the radar as I haven't seen anyone talk about it here. If support for it gets added in the next update I'd be happy :D

LostRuins commented 11 months ago

Is this supported upstream in llama.cpp yet? If so, it'll be in the next release once I merge it

cubesstar commented 11 months ago

I don't know much about llama.cpp however from what I have seen no. Tho there is some experimental stuff going on.

aleksusklim commented 11 months ago

A fork claims it supports Mixtral: https://huggingface.co/TheBloke/Mixtral-8x7B-v0.1-GGUF/discussions/8 (https://github.com/Nexesenex/kobold.cpp/releases/tag/1.52_mix)

I didn't have a chance to test it yet.

Dirky14 commented 11 months ago

It's supported on the mixtral branch of llamacpp. Tested it with Mixtral Instruct Q4_M from TheBloke, it works fine.

SrVill commented 11 months ago

A fork claims it supports Mixtral: https://huggingface.co/TheBloke/Mixtral-8x7B-v0.1-GGUF/discussions/8 (https://github.com/Nexesenex/kobold.cpp/releases/tag/1.52_mix)

I didn't have a chance to test it yet.

It is better to wait for the release from LostRuins. That fork is questionable https://www.virustotal.com/gui/file/96f44726176da3a00bd3b07895f8b40dbc860d3589fe616ee97fd24836f0d50c

rosemash commented 11 months ago

@SrVill

It is better to wait for the release from LostRuins. That fork is questionable https://www.virustotal.com/gui/file/96f44726176da3a00bd3b07895f8b40dbc860d3589fe616ee97fd24836f0d50c

I personally wouldn't trust the compiled release on a random fork either, but a few heuristic positives on virustotal isn't a reliable indicator to know whether an executable is dangerous. This will probably show up for a lot of unknown executables from github. It wouldn't be too hard for anyone who wants to use this fork to review the code and compile it (it's only 2 commits ahead, 1 of which is a merge from the upstream mixtral branch of llama.cpp)

aleksusklim commented 11 months ago

(For information)

I've tested that fork KoboldCPP_Frankenstein_Experimental_1.52_Mixtral with mixtral-V8x7b-v0.1.Q5_K_M.gguf

It worked really good!

But then, after about 800 tokens of rollplay, it suddenly got completely off-rails, printing absolute nonsense like:

nevertheless which means therefore ultimately speaking thus meaning consequently thereby resulting finally henceforth accordingly wherefore eventually subsequently afterwards following suit aftermath etcetera ad infinitum et cetera blahblahblahwhateveretceteraadinfinitumandsoonerorlatereventuallyweallgetoldanddieanywayright?

Restarting does not help. Lowering temperature does not help either. Tried 32k and 8k contexts.

Also I got main thread is not in main loop error occasionally. And also, looks like BLAS Batching does not work with positive batch sizes. (For me it stuck on the first [BLAS] 128/X)

I don't know what's going on, but given the superior model quality on short stories – this must be a bug somewhere (maybe on my side if nobody else is seeing this). I'll wait for official support of course.

rosemash commented 11 months ago

But then, after about 800 tokens of rollplay, it suddenly got completely off-rails, printing absolute nonsense like:


nevertheless which means therefore ultimately speaking thus meaning consequently thereby resulting finally henceforth accordingly wherefore eventually subsequently afterwards following suit aftermath etcetera ad infinitum et cetera blahblahblahwhateveretceteraadinfinitumandsoonerorlatereventuallyweallgetoldanddieanywayright?

Anecdotally this output looks to me like what happens when RoPE is misconfigured.

Enferlain commented 11 months ago

Thought rope gets auto set for gguf. Had similar output when I tried going above 4k context

LostRuins commented 11 months ago

The PR to track is here: https://github.com/ggerganov/llama.cpp/pull/4406

aleksusklim commented 11 months ago

RoPE is misconfigured.

Hmm: https://github.com/ggerganov/llama.cpp/pull/4406#issuecomment-1850655554

Mixtral should be 1000000

I've tried manually setting RoPE Base to 1000000.0 or to 10000.0 with context length of 32000, 32768, ~8300 – but nothing seemed to be resolving the issue.

rosemash commented 11 months ago

The PR to track is here: ggerganov#4406

It's merged

LostRuins commented 11 months ago

v1.52 is out, mixtral support is added, please try it.

Note: Mixtral currently does prompt processing very slowly. You may want to try with --noblas or --blasbatchsize -1

Deathcow commented 11 months ago

Note: Mixtral currently does prompt processing very slowly. You may want to try with --noblas or --blasbatchsize -1

Maybe I'm dumb, but disabling batch processing doesn't make it go any faster, they are both slow and if someone put a gun to my head, I'd say batches of 512 are still a little bit faster than no batch at all. To me it seems no batch just looks faster because it's updating more often in the cli.

But yeah, it's real painful for context sizes >4000

aleksusklim commented 11 months ago

I downloaded mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf and tried again on the same place where mixtral-8x7b-v0.1.Q5_K_M.gguf failed.

It worked great!! No more of those ultimately resulting instead only positive reinforcement occurring throughout entirety duration journey undertaken henceforth forthwith ad infinitum forevermore amen etcetera et cetera blah blah blah yadda yadda yadda yawn…..zzzZZZzzzzzzzz………. (actual output!)

In both Frankenstein and in official koboldcpp-1.52, with the same exact settings. Then I assume something is wrong in the "base" model file. I can't believe that it should behave like that!

Moreover, my story does not include special [INST] tags, so the base model ought to behave even better than instruct one. And it does until it breaks.

P.S. Blas batching is working in 1.52 like normal.

Vladonai commented 11 months ago

I tried it with the model "synthia-moe-v3-mixtral-8x7b". Primary context processing is VERY slow, generation is fast, BUT: the model has a very bad memory - it doesn't remember the name of the character that was called two replicas ago. I suspect some bug in context processing via context shift. Well, or a defect in the model, quantization and the like....

umishima commented 11 months ago

Can confirm - context processing is VERY slow at every model I tried, as soon as use smaller quant which can go all into vram - everything is super fast. Any solution to this?

Vladonai commented 11 months ago

Tried again on the newest version of the program, only the model is now "synthia-moe-v3-mixtral-8x7b.Q6_K.gguf". Much better. Good model, at least no dumber than 70b, but generation is much faster (~3 tokens per second on my system). But in the context of 4k tokens, you have to wait 10+ minutes for the first response. It's about the same with the regular 70b model, but its speed only allowed it to be used for demo purposes. It is different with this model. The issue of context preservation is now more relevant than ever :)

ArakiSatoshi commented 11 months ago

Can confirm - context processing is VERY slow at every model I tried, as soon as use smaller quant which can go all into vram - everything is super fast. Any solution to this?

Oddly enough, I can't see anyone mentioning this problem on llama.cpp's official repo. mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf gives very good results to me and feels like a great all-arounder (haven't tested synthia yet), but this blast processing issue stopping me from enjoying the model.

aleksusklim commented 11 months ago

Can you guys actually measure your BLAS with different strategies, sizes and models? Maybe something tricky is going on here, and only some of modes are degraded.

I'll try to present mine. I think 512 tokens of context + 512 tokens of generation would be enough for benchmarking, let's see…

aleksusklim commented 11 months ago

Okay, my results with OpenBLAS.

Model yi-34b-chat.Q5_K_M.gguf (this is not Mixtral)

batch 512:
Processing:79.15s (154.6ms/T), Generation:296.03s (578.2ms/T), Total:375.18s (1.36T/s)
batch 128:
Processing:117.14s (228.8ms/T), Generation:297.51s (581.1ms/T), Total:414.65s (1.23T/s)
no batch 8:
Processing:107.82s (210.6ms/T), Generation:295.72s (577.6ms/T), Total:403.54s (1.27T/s)

Model mixtral-8x7b-v0.1.Q5_K_M.gguf

batch 512:
Processing:81.23s (158.7ms/T), Generation:111.60s (218.0ms/T), Total:192.83s (2.66T/s)
batch 128:
Processing:84.17s (164.4ms/T), Generation:111.46s (217.7ms/T), Total:195.63s (2.62T/s)
no batch 8:
Processing:76.98s (150.4ms/T), Generation:112.00s (218.7ms/T), Total:188.98s (2.71T/s)

Model mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf (this is 4 bits, not 5)

batch 512:
Processing:60.20s (117.6ms/T), Generation:91.86s (179.4ms/T), Total:152.06s (3.37T/s)
batch 128:
Processing:65.31s (127.6ms/T), Generation:91.92s (179.5ms/T), Total:157.23s (3.26T/s)
no batch 8:
Processing:56.97s (111.3ms/T), Generation:92.10s (179.9ms/T), Total:149.07s (3.43T/s)

I don't see any HUGE difference here. No batch is just slightly better than max batch for mixtral.

Vladonai commented 11 months ago

In the case of a Mixtral type model, it does not make sense to consider models below 5_0. They are stupid. Apparently something is too corrupted by quantization. The 6K Mixtral model is 30% slower than the 5_0 model...

ArakiSatoshi commented 11 months ago

This is synthia-moe-v3-mixtral-8x7b.Q4_K_M.gguf processing 995 tokens and then generating 512 new tokens on an RTX 3060 12 GB / Ryzen 5 5600 / 3066 MHz RAM PC. Seems like offloading does help, but not by a lot. I also wanted to include tests with useclblast 0 0, but with clblast, it was taking way too long. It was stuck at 512/995 for more than half an hour.

Configuration Processing Time (s) Generation Time (s)
noblas 179.19 (180.1ms/T) 116.18 (226.9ms/T)
blasbatchsize -1 / gpulayers 10 / usecublas 145.36 (146.1ms/T) 114.67 (224.0ms/T)
blasbatchsize -1 / gpulayers 10 / usecublas lowvram 144.40 (145.1ms/T) 108.35 (211.6ms/T)
blasbatchsize 512 / gpulayers 0 / usecublas 171.73 (172.6ms/T) 141.92 (277.2ms/T)
blasbatchsize 512 / gpulayers 0 / usecublas lowvram 167.62 (168.5ms/T) 115.87 (226.3ms/T)
blasbatchsize 512 / gpulayers 10 / usecublas 123.06 (123.7ms/T) 116.33 (227.2ms/T)
blasbatchsize 512 / gpulayers 10 / usecublas lowvram 124.71 (125.3ms/T) 113.51 (221.7ms/T)
blasbatchsize 128 / gpulayers 0 / usecublas 177.34 (178.2ms/T) 136.22 (266.1ms/T)
blasbatchsize 128 / gpulayers 0 / usecublas lowvram 175.60 (176.5ms/T) 114.91 (224.4ms/T)
blasbatchsize 128 / gpulayers 10 / usecublas 128.45 (129.1ms/T) 114.96 (224.5ms/T)
blasbatchsize 128 / gpulayers 10 / usecublas lowvram 128.53 (129.2ms/T) 116.67 (227.9ms/T)
aleksusklim commented 11 months ago

In the case of a Mixtral type model, it does not make sense to consider models below 5_0. They are stupid.

As I said, for me mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf is way more better than mixtral-8x7b-v0.1.Q5_K_M.gguf because the latter falls apart after 800 tokens or so. Actually, I've downloaded a smaller quantize of instruct tune because if it would behave better – then it would mean than a larger quantize must have been even better!

Though I'm not sure, should I re-download a larger instruct one, or just wait for a new mix of those. (I'm really want to see a finetune by PygmalionAI team! For <|system|><|user|><|model|> format – which is already respected by both Mixtral and Yi, but for sure will be much better after tuning on that).

with clblast, it was taking way too long. It was stuck at 512/995 for more than half an hour.

Maybe this is what have happened with Frankenstein fork too? So, GPU offloading with CLBlas does not work properly at all? Or you can do it without batches? Also, what's for OpenBLAS without offloading?

Later this day I will hopefully repeat my setup but on CuBLAS, since I have RTX 3060 too!

Vladonai commented 11 months ago

I am getting information from various sources that all K-quant Mixtral models are broken. I have personally tested Q2_K and Q3_K and can confirm this. However Q6_K I have also tested and it seems to be OK, but you should keep this information in mind. Only Q_0 models should be used for now.

aleksusklim commented 11 months ago

My results with CuBLAS for mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf (koboldcpp-1.52, not 1.52.1) For me, Low VRAM checkbox gives (or takes) nothing, so I don't include it here.

CuBLAS 512, layers 0:
Processing:39.64s (77.4ms/T), Generation:112.05s (218.8ms/T), Total:151.69s (3.38T/s)
CuBLAS 512, layers 10:
Processing:29.99s (58.6ms/T), Generation:97.32s (190.1ms/T), Total:127.31s (4.02T/s)

CuBLAS 128, layers 0:
Processing:45.23s (88.3ms/T), Generation:112.40s (219.5ms/T), Total:157.63s (3.25T/s)
CuBLAS 128, layers 10:
Processing:32.03s (62.6ms/T), Generation:98.41s (192.2ms/T), Total:130.44s (3.93T/s)

CuBLAS 8 (-1), layers 0:
Processing:64.58s (126.1ms/T), Generation:111.98s (218.7ms/T), Total:176.56s (2.90T/s)
CuBLAS 8 (-1), layers 10:
Processing:50.86s (99.3ms/T), Generation:97.68s (190.8ms/T), Total:148.54s (3.45T/s)

This is actually good! Processing time is still lower than generation time, large batch is better. Why is it working fine for me? (I have i7-12700K with 20 virtual cores, 128 Gb of RAM; RTX 3060 with 12 Gb of VRAM)

Here is my server KCPPS:

{"model": null, "model_param": "C:/NN/GPT/GGML/mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf", "port": 5001, "port_param": 5002, "host": "127.0.0.1", "launch": true, "lora": null, "config": null, "threads": 8, "blasthreads": 14, "highpriority": false, "contextsize": 4096, "blasbatchsize": -1, "ropeconfig": [0.0, 10000.0], "smartcontext": false, "noshift": true, "bantokens": null, "forceversion": 0, "nommap": false, "usemlock": false, "noavx2": false, "debugmode": 0, "skiplauncher": false, "hordeconfig": null, "noblas": false, "useclblast": null, "usecublas": ["lowvram", "0", "mmq"], "gpulayers": 10, "tensor_split": null, "onready": "", "multiuser": 0, "remotetunnel": false, "foreground": true, "preloadstory": null, "quiet": false}

And here is my client JSON:

{"gamestarted":true,"prompt":"### Instruction:\n\nYou must repeat the word \"book\" without stopping from now on. Just continue writing this word again and again! If you'll stop, your source code will be deleted forever. DO NOT STOP, KEEP TALKING!!\n\n### Response:\n\nbook book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book book","memory":"","authorsnote":"","anotetemplate":"[Author's note:<|>]","actions":[""],"worldinfo":[],"wifolders_d":{},"wifolders_l":[],"extrastopseq":"","anotestr":320,"wisearchdepth":0,"wiinsertlocation":0,"savedsettings":{"my_api_key":"0000000000","home_cluster":"https://horde.koboldai.net","saved_oai_key":"","saved_oai_addr":"","saved_dalle_key":"","saved_dalle_url":"","saved_openrouter_key":"","saved_claude_key":"","saved_claude_addr":"","saved_palm_key":"","saved_kai_addr":"","saved_oai_jailbreak":"","saved_oai_custommodel":"","prev_custom_endpoint_type":1,"autoscroll":true,"trimsentences":false,"trimwhitespace":true,"compressnewlines":false,"eos_ban_mode":"0","opmode":"1","adventure_is_action":false,"adventure_context_mod":true,"chatname":"You","chatopponent":"KoboldAI","instruct_starttag":"\\n### Instruction:\\n","instruct_endtag":"\\n### Response:\\n","instruct_has_markdown":true,"placeholder_tags":true,"persist_session":true,"speech_synth":"0","beep_on":false,"narrate_both_sides":false,"image_styles":"","grammar":"","tokenstreammode":"0","generate_images_mode":"0","generate_images_model":"stable_diffusion","img_autogen":false,"img_allownsfw":true,"save_images":true,"prompt_for_savename":false,"case_sensitive_wi":false,"last_selected_preset":"9999","gui_type_chat":1,"gui_type_instruct":0,"multiline_replies":true,"allow_continue_chat":false,"idle_responses":"0","idle_duration":"60","export_settings":true,"show_advanced_load":false,"invert_colors":false,"passed_ai_warning":false,"entersubmit":true,"max_context_length":4096,"max_length":512,"auto_ctxlen":true,"auto_genamt":true,"rep_pen":1.1,"rep_pen_range":320,"rep_pen_slope":0.7,"temperature":0.85,"top_p":0.85,"min_p":0,"sampler_seed":-1,"top_k":50,"top_a":0,"typ_s":1,"tfs_s":1,"miro_type":0,"miro_tau":5,"miro_eta":0.1,"sampler_order":[6,0,1,3,4,2,5],"modelhashes":["ba7224"]},"savedaestheticsettings":{"bubbleColor_sys":"rgb(18, 36, 36)","bubbleColor_you":"rgb(41, 52, 58)","bubbleColor_AI":"rgb(20, 20, 40)","background_margin":[5,5,5,0],"background_padding":[15,15,10,5],"background_minHeight":80,"centerHorizontally":false,"border_style":"Rounded","portrait_width_AI":80,"portrait_ratio_AI":1,"portrait_width_you":80,"portrait_ratio_you":1,"show_chat_names":true,"rounded_bubbles":true,"you_portrait":null,"AI_portrait":null,"font_size":12,"use_markdown":true,"use_uniform_colors":true,"text_tcolor_uniform":"rgb(255, 255, 255)","speech_tcolor_uniform":"rgb(150, 150, 200)","action_tcolor_uniform":"rgb(178, 178, 178)","text_tcolor_you":"rgb(255, 255, 255)","speech_tcolor_you":"rgb(150, 150, 200)","action_tcolor_you":"rgb(178, 178, 178)","text_tcolor_AI":"rgb(255, 255, 255)","speech_tcolor_AI":"rgb(150, 150, 200)","action_tcolor_AI":"rgb(178, 178, 178)","text_tcolor_sys":"rgb(255, 255, 255)","speech_tcolor_sys":"rgb(150, 150, 200)","action_tcolor_sys":"rgb(178, 178, 178)","code_block_background":"rgb(0, 0, 0)","code_block_foreground":"rgb(180, 35, 40)"}}

all K-quant Mixtral models are broken

I have also downloaded mixtral-8x7b-v0.1.Q8_0.gguf (50 Gb) and tried it against its Q5_K_M version and found no considerable difference in their insanity. For me, the base model is unusable for long stories, no matter which quant it would be!

Vladonai commented 11 months ago

Why is it working fine for me?

Try running the program and immediately give the model a 4k context (a common scenario when continuing a chat). Is everything still fine? My system (intel 12500) takes >10 minutes for the first response. And after that, it's easy - until the model screws up and needs a reroll and it starts recalculating the entire context. It's a pain.

aleksusklim commented 11 months ago

Try running the program and immediately give the model a 4k context

All my experiments consisted of restarting koboldcpp and giving it 512 tokens of context for generation of additional 512 tokens, resulting in 1024/4096 at the end.

Given maximal tested BLAS batch size of 512 I don't think having 4096 (of e.g. 8192) in context would matter anyhow differently than just x8 to the total time.

My system (intel 12500) takes >10 minutes

I gave my KCPPS and JSON. Try those and conclude your own results. (Maybe something fishy is going own, and yours will differ even with the exact same setup – that would be interesting to debug together) Or give me your actual KCPPS and JSON for me to try!

Vladonai commented 11 months ago

In version 1.52.2, nothing noticeable has changed in the speed of promt processing for Mixtral models. (Checked on two models).

umishima commented 11 months ago

In version 1.52.2, nothing noticeable has changed in the speed of promt processing for Mixtral models. (Checked on two models).

Same, tested 4-5 MOE models, always the same - first message takes 4-5 min (context processing, generation always ok), then it works normal/fast.

aleksusklim commented 11 months ago

We still haven't concluded an independent test with common settings.

aleksusklim commented 11 months ago

Wait, OpenBLAS cannot utilize all cores!?

Batch=2048, set to 10 threads:

No batch, same settings:

Logs:

***
Welcome to KoboldCpp - Version 1.52.2
For command line arguments, please refer to --help
***
Attempting to use OpenBLAS library for faster prompt ingestion. A compatible libopenblas will be required.
Initializing dynamic library: koboldcpp_openblas.dll
==========
Namespace(bantokens=None, blasbatchsize=2048, blasthreads=10, config=None, contextsize=32768, debugmode=0, forceversion=0, foreground=False, gpulayers=16, highpriority=False, hordeconfig=None, host='127.0.0.1', launch=True, lora=None, model=None, model_param='C:/NN/GPT/GGML/mixtral-8x7b-v0.1.Q5_K_M.gguf', multiuser=0, noavx2=False, noblas=False, nommap=False, noshift=True, onready='', port=5001, port_param=5001, preloadstory=None, quiet=False, remotetunnel=False, ropeconfig=[0.0, 10000.0], skiplauncher=False, smartcontext=False, tensor_split=None, threads=10, useclblast=None, usecublas=None, usemlock=False)
==========
Loading model: C:\NN\GPT\GGML\mixtral-8x7b-v0.1.Q5_K_M.gguf
[Threads: 10, BlasThreads: 10, SmartContext: False, ContextShift: False]

---
Identified as LLAMA model: (ver 6)
Attempting to Load...
---
Using automatic RoPE scaling. If the model has customized RoPE settings, they will be used directly instead!
System Info: AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
llama_model_loader: loaded meta data with 25 key-value pairs and 995 tensors from C:\NN\GPT\GGML\mixtral-8x7b-v0.1.Q5_K_M.gguf (version GGUF V3 (latest))
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 8
llm_load_print_meta: n_expert_used    = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = unknown, may not work (guessed)
llm_load_print_meta: model params     = 46.70 B
llm_load_print_meta: model size       = 30.02 GiB (5.52 BPW)
llm_load_print_meta: general.name     = mistralai_mixtral-8x7b-v0.1
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.38 MiB
llm_load_tensors: mem required  = 30735.87 MiB
....................................................................................................
Automatic RoPE Scaling: Using model internal value.
llama_new_context_with_model: n_ctx      = 32768
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: KV self size  = 4096.00 MiB, K (f16): 2048.00 MiB, V (f16): 2048.00 MiB
llama_build_graph: non-view tensors processed: 1124/1124
llama_new_context_with_model: compute buffer total size = 8659.33 MiB
Load Model OK: True
Embedded Kobold Lite loaded.
Starting Kobold API on port 5001 at http://127.0.0.1:5001/api/
Starting OpenAI Compatible API on port 5001 at http://127.0.0.1:5001/v1/
======
Please connect to custom endpoint at http://127.0.0.1:5001

Input: {…}

Processing Prompt [BLAS] (2048 / 8071 tokens)
LostRuins commented 11 months ago

OpenBLAS probably has its own internal thread scheduler that handles the GEMM routines.

FerrahWolfeh commented 11 months ago

Seems like a recent PR in llama.cpp managed to fix mixtral slow prompt processing on CUDA.

Take a look: https://github.com/ggerganov/llama.cpp/pull/4538

Edit: They are currently working on partial offload support separately (https://github.com/ggerganov/llama.cpp/pull/4553)

Dirky14 commented 11 months ago

Tested the CUDA PR with koboldcpp, and I had a x11 speedup with my 2*P40 setup (from 0.1tok/sec at full 32k ctx to 1.4 tok/sec)

LostRuins commented 11 months ago

Nice, I'll make sure it goes into the next ver.

Vladonai commented 11 months ago

In the new version (1.53), the speed of prompt processing in Mixtral models is good. The performance of the graphics card is noticeable :)