LostRuins / koboldcpp

A simple one-file way to run various GGML and GGUF models with a KoboldAI UI
https://github.com/lostruins/koboldcpp
GNU Affero General Public License v3.0
4.41k stars 318 forks source link

Command-R #761

Open SrVill opened 3 months ago

SrVill commented 3 months ago

Will there ever be a version with Command-R (https://huggingface.co/CohereForAI/c4ai-command-r-v01) support? llama.cpp supports this model for a long time.

GrennKren commented 3 months ago

Build it from concedo_experimental branch. I tried it before and it works

tomgm777 commented 3 months ago

Build it from concedo_experimental branch. I tried it before and it works

I made it using win according to Compiling on Windows, but command-r does not start. I have overwritten llama.cpp and llama.h with the latest versions, but do I need to overwrite anything else?

GrennKren commented 3 months ago

Build it from concedo_experimental branch. I tried it before and it works

I made it using win according to Compiling on Windows, but command-r does not start. I have overwritten llama.cpp and llama.h with the latest versions, but do I need to overwrite anything else?

I've never tried it, but does compiling on Windows require overwriting some files? I'm on Linux, I usually just run these commands:

$ git clone https://github.com/LostRuins/koboldcpp
$ cd koboldcpp
$ git switch concedo_experimental
$ make LLAMA_CUBLAS=1
tomgm777 commented 3 months ago

Build it from concedo_experimental branch. I tried it before and it works

I made it using win according to Compiling on Windows, but command-r does not start. I have overwritten llama.cpp and llama.h with the latest versions, but do I need to overwrite anything else?

I've never tried it, but does compiling on Windows require overwriting some files? I'm on Linux, I usually just run these commands:

$ git clone https://github.com/LostRuins/koboldcpp
$ cd koboldcpp
$ git switch concedo_experimental
$ make LLAMA_CUBLAS=1

thank you I understand how to use branches.

EwoutH commented 3 months ago

They now also released a larger, 104B parameter model: C4AI Command R+

CamiloMM commented 3 months ago

Support for Command-R comes from commit 12247f4 - PR#6033 so I assume Kobold just needs to include the latest llama.cpp?

Also really cool to see them releasing an 104B model, though I assume it takes even more than 2x24GB to run a quant of that, right? Because of the KV Cache. I actually don't know how much VRAM you need for iMat quants of the 35b.

Vladonai commented 3 months ago

The big question is whether imatrix even work as expected of them...

henk717 commented 3 months ago

Maintainer of the repo has been away from home until April 7th. Of course this model will be added upon his return. He will need a bit of time to catch up with upstream.

CamiloMM commented 3 months ago

Ah that's perfectly understandable. Thanks for the heads up Henky.

LostRuins commented 3 months ago

Hello, can you please try the latest release and see if it works for you now?

SrVill commented 3 months ago

Hello, can you please try the latest release and see if it works for you now?

It works great!

SabinStargem commented 3 months ago

I gave v1.61.2 a try, but Command-R+ doesn't boot. The version of Command-R+ I used is the IQ4xs Imat from Dranger, both as an joined file and as separate splits. It might be an issue with the archives, as other models that I used HJSplit or Peazip worked fine. I will report the possibility to Dranger, just in case.

https://huggingface.co/dranger003/c4ai-command-r-plus-iMat.GGUF

Here is the error message.


Welcome to KoboldCpp - Version 1.62.1 For command line arguments, please refer to --help


Attempting to use CuBLAS library for faster prompt ingestion. A compatible CuBLAS will be required. Initializing dynamic library: koboldcpp_cublas.dll

Namespace(bantokens=None, benchmark=None, blasbatchsize=512, blasthreads=31, chatcompletionsadapter=None, config=None, contextsize=32768, debugmode=0, forceversion=0, foreground=False, gpulayers=31, highpriority=False, hordeconfig=None, host='', ignoremissing=False, launch=True, lora=None, mmproj=None, model=None, model_param='C:/KoboldCPP/Models/ggml-c4ai-command-r-plus-104b-iq4_xs.gguf', multiuser=1, noavx2=False, noblas=False, nocertify=False, nommap=False, noshift=False, onready='', password=None, port=5001, port_param=5001, preloadstory=None, quiet=False, remotetunnel=False, ropeconfig=[0.0, 10000.0], sdconfig=None, skiplauncher=False, smartcontext=False, ssl=None, tensor_split=None, threads=31, useclblast=None, usecublas=['normal', '0', 'mmq'], usemlock=True, usevulkan=None)

Loading model: C:\KoboldCPP\Models\ggml-c4ai-command-r-plus-104b-iq4_xs.gguf [Threads: 31, BlasThreads: 31, SmartContext: False, ContextShift: True]

The reported GGUF Arch is: command-r


Identified as GGUF model: (ver 6) Attempting to Load...

Using automatic RoPE scaling. If the model has customized RoPE settings, they will be used directly instead! System Info: AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | llama_model_load: error loading model: invalid split file: C:\KoboldCPP\Models\ggml-c4ai-command-r-plus-104b-iq4_xs.ggufllama_load_model_from_file: failed to load model Traceback (most recent call last): File "koboldcpp.py", line 3192, in File "koboldcpp.py", line 2942, in main File "koboldcpp.py", line 398, in load_model OSError: exception: access violation reading 0x0000000000000070 [24972] Failed to execute script 'koboldcpp' due to unhandled exception!

[process exited with code 1 (0x00000001)]

LostRuins commented 3 months ago

How did you merge the GGUF files? Did you use the official split tool?

LostRuins commented 3 months ago

CR+ is too big for me to test personally. But I should be using the same code as upstream to run it.

SabinStargem commented 3 months ago

I used both HJ Split and PeaZip for joining the files. Then I tried loading the model with the split archives that I started with. With other models that I used HJ and PeaZip with, the resulting models worked fine.

There is a 23gb IQ1s. While cruddy, that might fit into your hardware envelope for testing?

SabinStargem commented 3 months ago

Ah, the split files do work. They were originally formatted as -of-, instead of the .001 and .002 that I used.

The AI has successfully typed out some legible text, so it looks like the implementation is an success.

Vladonai commented 3 months ago

In principle, why not make loading of partitioned models directly from the program? You write it in the command line: --model splitmodel.001 and if the extension is .001 then after loading this part the program tries to load a part with the same name but with extension .002 and so on...

SabinStargem commented 3 months ago

I don't like having many files, if a single condensed file can do the job. Unfortunately, whatever method that created the splits doesn't seem friendly with casual software users like myself. You have to use a terminal for merging, apparently.

henk717 commented 3 months ago

Side note for runpod rental users, https://koboldai.org/runpodcpp does accept split files it seperates by a comma. The download script for GGUF combines them prior to launching kcpp.

But I also haven't managed to run it using our own docker which normally handles split files perfectly.

LostRuins commented 3 months ago

@SabinStargem just for clarity for everyone, can you put the correct file names of the split files that ended up working for you? I know its quite sensitive to the actual filename format

henk717 commented 3 months ago

Managed to implement gguf-split handling for the Docker / Runpod / VastAI containers. Usage for the end user has not changed, but when a new gguf-split file is detected it will refrain from merging the files together with cat and instead lets Koboldcpp handle the loading of the split files.

nanolion commented 3 months ago

It worked without issues on my end, loading the first of split files loads the entire model. Here is an example file names: ggml-c4ai-command-r-plus-104b-iq4_xs-00001-of-00002.gguf ggml-c4ai-command-r-plus-104b-iq4_xs-00002-of-00002.gguf

henk717 commented 3 months ago

I can confirm they work fine, the change of the format for split files confused everyone familair with the old method. I had to rewrite my docker to support the new method but now on Runpod it works. And locally if its these 00001-of files you indeed have to either just load them or merge them with gguf-split instead of the old methods.

aleksusklim commented 3 months ago

https://huggingface.co/dranger003/c4ai-command-r-plus-iMat.GGUF

Sorry if this is a wrong place, but does anybody know which files are more preferable: that or https://huggingface.co/pmysl/c4ai-command-r-plus-GGUF ? I've downloaded Q5_K_M pair from the latter and it seem to work fine in 1.62.2 (without joining the splits)

aleksusklim commented 2 months ago

Actually, no. I get a strange output (off rails, not finishing sentences but keeps adding less and less sensical words) after just a few turns, even with 0 temp (or 1 for top_k). Tried both

(With their second parts stored to the same folder)

Do I have to specify --ropeconfig for CommandR+ explicitly?

LostRuins commented 2 months ago

if you are exceeding their normal context size, yes.

aleksusklim commented 2 months ago

So, what's their parameters? And what's their nominal context size? I'm kinda confused, since for GGUF models this always worked automatically, relying on model internal metadata.

LostRuins commented 2 months ago

Seems to be about 8k. I don't see it defined in the gguf metadata though.

henk717 commented 2 months ago

Keep in mind this model is allergic to repetition penalty. Turn it way down to something like 1.01. That isn't a bug but just happens on some tunes especially larger models. Seen it before on NeoX and some finetunes to.

aleksusklim commented 2 months ago

Oh! I had my usual 1.1; will definitely try disabling it, thanks.

aleksusklim commented 2 months ago

Hmm, lowering max context to 8k seems fixed the issue, but disabling rep_pen looks even better! (Otherwise it seemed that the model was shy to print , or . to finish its sentences)

But why just 8k, I thought CommandR can handle very long contexts by default like Mixtral?

LostRuins commented 2 months ago

I don't know, I'm just guessing based on the embed positions in the json.

anunknowperson commented 2 months ago

CommandR and CommandR+ models have 128k context size, as specified here: https://huggingface.co/CohereForAI/c4ai-command-r-v01

anunknowperson commented 2 months ago

I have this error on latest koboldcpp when loading CommandR. Same gguf works perfectly in text-generation-webui.


Welcome to KoboldCpp - Version 1.64 For command line arguments, please refer to --help


Attempting to use CuBLAS library for faster prompt ingestion. A compatible CuBLAS will be required. Initializing dynamic library: koboldcpp_cublas.dll

Namespace(benchmark=None, blasbatchsize=512, blasthreads=17, chatcompletionsadapter=None, config=None, contextsize=2048, debugmode=0, flashattention=False, forceversion=0, foreground=False, gpulayers=100, highpriority=False, hordeconfig=None, host='', ignoremissing=False, launch=True, lora=None, mmproj=None, model=None, model_param='D:/text-generation-webui/models/c4ai-command-r-v01.gguf', multiuser=1, noavx2=False, noblas=False, nocertify=False, nommap=False, noshift=False, onready='', password=None, port=5001, port_param=5001, preloadstory=None, quiet=False, remotetunnel=False, ropeconfig=[0.0, 10000.0], sdconfig=None, skiplauncher=False, smartcontext=False, ssl=None, tensor_split=None, threads=17, useclblast=None, usecublas=['normal', '0', 'mmq'], usemlock=False, usevulkan=None)

Loading model: D:\text-generation-webui\models\c4ai-command-r-v01.gguf [Threads: 17, BlasThreads: 17, SmartContext: False, ContextShift: True]

The reported GGUF Arch is: command-r


Identified as GGUF model: (ver 6) Attempting to Load...

Using automatic RoPE scaling. If the model has customized RoPE settings, they will be used directly instead! System Info: AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | llama_model_loader: loaded meta data with 32 key-value pairs and 322 tensors from D:\text-generation-webui\models\c4ai-command-r-v01->ї -?llama_model_load: error loading model: error loading model vocabulary: unknown pre-tokenizer type: 'command-r' llama_load_model_from_file: failed to load model Traceback (most recent call last): File "koboldcpp.py", line 3332, in File "koboldcpp.py", line 3074, in main File "koboldcpp.py", line 396, in load_model OSError: exception: access violation reading 0x0000000000000070 [14020] Failed to execute script 'koboldcpp' due to unhandled exception!

henk717 commented 2 months ago

They very recently switched the tokenization for this model, so your quant format is newer than the newest Koboldcpp release. Hold on to that file since it will probably work better in our next release, feel free to link with it so we can double check if it works.

The one I just tested with does load but doesn't have the new tokenizer for it: https://huggingface.co/andrewcanis/c4ai-command-r-v01-GGUF/resolve/main/c4ai-command-r-v01-Q4_K_S.gguf?download=true

Side note, you are using 1.64 which has known clip bugs. If you care about Llava functionality in your models upgrade to 1.64.1.