Open SrVill opened 3 months ago
Build it from concedo_experimental branch. I tried it before and it works
Build it from concedo_experimental branch. I tried it before and it works
I made it using win according to Compiling on Windows, but command-r does not start. I have overwritten llama.cpp and llama.h with the latest versions, but do I need to overwrite anything else?
Build it from concedo_experimental branch. I tried it before and it works
I made it using win according to Compiling on Windows, but command-r does not start. I have overwritten llama.cpp and llama.h with the latest versions, but do I need to overwrite anything else?
I've never tried it, but does compiling on Windows require overwriting some files? I'm on Linux, I usually just run these commands:
$ git clone https://github.com/LostRuins/koboldcpp
$ cd koboldcpp
$ git switch concedo_experimental
$ make LLAMA_CUBLAS=1
Build it from concedo_experimental branch. I tried it before and it works
I made it using win according to Compiling on Windows, but command-r does not start. I have overwritten llama.cpp and llama.h with the latest versions, but do I need to overwrite anything else?
I've never tried it, but does compiling on Windows require overwriting some files? I'm on Linux, I usually just run these commands:
$ git clone https://github.com/LostRuins/koboldcpp $ cd koboldcpp $ git switch concedo_experimental $ make LLAMA_CUBLAS=1
thank you I understand how to use branches.
They now also released a larger, 104B parameter model: C4AI Command R+
Support for Command-R comes from commit 12247f4 - PR#6033 so I assume Kobold just needs to include the latest llama.cpp?
Also really cool to see them releasing an 104B model, though I assume it takes even more than 2x24GB to run a quant of that, right? Because of the KV Cache. I actually don't know how much VRAM you need for iMat quants of the 35b.
The big question is whether imatrix even work as expected of them...
Maintainer of the repo has been away from home until April 7th. Of course this model will be added upon his return. He will need a bit of time to catch up with upstream.
Ah that's perfectly understandable. Thanks for the heads up Henky.
Hello, can you please try the latest release and see if it works for you now?
Hello, can you please try the latest release and see if it works for you now?
It works great!
I gave v1.61.2 a try, but Command-R+ doesn't boot. The version of Command-R+ I used is the IQ4xs Imat from Dranger, both as an joined file and as separate splits. It might be an issue with the archives, as other models that I used HJSplit or Peazip worked fine. I will report the possibility to Dranger, just in case.
https://huggingface.co/dranger003/c4ai-command-r-plus-iMat.GGUF
Here is the error message.
Welcome to KoboldCpp - Version 1.62.1 For command line arguments, please refer to --help
Attempting to use CuBLAS library for faster prompt ingestion. A compatible CuBLAS will be required. Initializing dynamic library: koboldcpp_cublas.dll
Namespace(bantokens=None, benchmark=None, blasbatchsize=512, blasthreads=31, chatcompletionsadapter=None, config=None, contextsize=32768, debugmode=0, forceversion=0, foreground=False, gpulayers=31, highpriority=False, hordeconfig=None, host='', ignoremissing=False, launch=True, lora=None, mmproj=None, model=None, model_param='C:/KoboldCPP/Models/ggml-c4ai-command-r-plus-104b-iq4_xs.gguf', multiuser=1, noavx2=False, noblas=False, nocertify=False, nommap=False, noshift=False, onready='', password=None, port=5001, port_param=5001, preloadstory=None, quiet=False, remotetunnel=False, ropeconfig=[0.0, 10000.0], sdconfig=None, skiplauncher=False, smartcontext=False, ssl=None, tensor_split=None, threads=31, useclblast=None, usecublas=['normal', '0', 'mmq'], usemlock=True, usevulkan=None)
Loading model: C:\KoboldCPP\Models\ggml-c4ai-command-r-plus-104b-iq4_xs.gguf [Threads: 31, BlasThreads: 31, SmartContext: False, ContextShift: True]
The reported GGUF Arch is: command-r
Identified as GGUF model: (ver 6) Attempting to Load...
Using automatic RoPE scaling. If the model has customized RoPE settings, they will be used directly instead! System Info: AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | llama_model_load: error loading model: invalid split file: C:\KoboldCPP\Models\ggml-c4ai-command-r-plus-104b-iq4_xs.ggufllama_load_model_from_file: failed to load model Traceback (most recent call last): File "koboldcpp.py", line 3192, in
File "koboldcpp.py", line 2942, in main File "koboldcpp.py", line 398, in load_model OSError: exception: access violation reading 0x0000000000000070 [24972] Failed to execute script 'koboldcpp' due to unhandled exception! [process exited with code 1 (0x00000001)]
How did you merge the GGUF files? Did you use the official split tool?
CR+ is too big for me to test personally. But I should be using the same code as upstream to run it.
I used both HJ Split and PeaZip for joining the files. Then I tried loading the model with the split archives that I started with. With other models that I used HJ and PeaZip with, the resulting models worked fine.
There is a 23gb IQ1s. While cruddy, that might fit into your hardware envelope for testing?
Ah, the split files do work. They were originally formatted as -of-, instead of the .001 and .002 that I used.
The AI has successfully typed out some legible text, so it looks like the implementation is an success.
In principle, why not make loading of partitioned models directly from the program? You write it in the command line: --model splitmodel.001 and if the extension is .001 then after loading this part the program tries to load a part with the same name but with extension .002 and so on...
I don't like having many files, if a single condensed file can do the job. Unfortunately, whatever method that created the splits doesn't seem friendly with casual software users like myself. You have to use a terminal for merging, apparently.
Side note for runpod rental users, https://koboldai.org/runpodcpp does accept split files it seperates by a comma. The download script for GGUF combines them prior to launching kcpp.
But I also haven't managed to run it using our own docker which normally handles split files perfectly.
@SabinStargem just for clarity for everyone, can you put the correct file names of the split files that ended up working for you? I know its quite sensitive to the actual filename format
Managed to implement gguf-split handling for the Docker / Runpod / VastAI containers. Usage for the end user has not changed, but when a new gguf-split file is detected it will refrain from merging the files together with cat and instead lets Koboldcpp handle the loading of the split files.
It worked without issues on my end, loading the first of split files loads the entire model. Here is an example file names: ggml-c4ai-command-r-plus-104b-iq4_xs-00001-of-00002.gguf ggml-c4ai-command-r-plus-104b-iq4_xs-00002-of-00002.gguf
I can confirm they work fine, the change of the format for split files confused everyone familair with the old method. I had to rewrite my docker to support the new method but now on Runpod it works. And locally if its these 00001-of files you indeed have to either just load them or merge them with gguf-split instead of the old methods.
https://huggingface.co/dranger003/c4ai-command-r-plus-iMat.GGUF
Sorry if this is a wrong place, but does anybody know which files are more preferable: that or https://huggingface.co/pmysl/c4ai-command-r-plus-GGUF ? I've downloaded Q5_K_M pair from the latter and it seem to work fine in 1.62.2 (without joining the splits)
Actually, no. I get a strange output (off rails, not finishing sentences but keeps adding less and less sensical words) after just a few turns, even with 0 temp (or 1 for top_k). Tried both
command-r-plus-Q5_K_M-00001-of-00002.gguf
ggml-c4ai-command-r-plus-104b-q5_k_s-00001-of-00002.gguf
(With their second parts stored to the same folder)
Do I have to specify --ropeconfig
for CommandR+ explicitly?
if you are exceeding their normal context size, yes.
So, what's their parameters? And what's their nominal context size? I'm kinda confused, since for GGUF models this always worked automatically, relying on model internal metadata.
Seems to be about 8k. I don't see it defined in the gguf metadata though.
Keep in mind this model is allergic to repetition penalty. Turn it way down to something like 1.01. That isn't a bug but just happens on some tunes especially larger models. Seen it before on NeoX and some finetunes to.
Oh! I had my usual 1.1; will definitely try disabling it, thanks.
Hmm, lowering max context to 8k seems fixed the issue, but disabling rep_pen looks even better! (Otherwise it seemed that the model was shy to print ,
or .
to finish its sentences)
But why just 8k, I thought CommandR can handle very long contexts by default like Mixtral?
I don't know, I'm just guessing based on the embed positions in the json.
CommandR and CommandR+ models have 128k context size, as specified here: https://huggingface.co/CohereForAI/c4ai-command-r-v01
I have this error on latest koboldcpp when loading CommandR. Same gguf works perfectly in text-generation-webui.
Welcome to KoboldCpp - Version 1.64 For command line arguments, please refer to --help
Loading model: D:\text-generation-webui\models\c4ai-command-r-v01.gguf [Threads: 17, BlasThreads: 17, SmartContext: False, ContextShift: True]
The reported GGUF Arch is: command-r
Using automatic RoPE scaling. If the model has customized RoPE settings, they will be used directly instead!
System Info: AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
llama_model_loader: loaded meta data with 32 key-value pairs and 322 tensors from D:\text-generation-webui\models\c4ai-command-r-v01->ї
-?llama_model_load: error loading model: error loading model vocabulary: unknown pre-tokenizer type: 'command-r'
llama_load_model_from_file: failed to load model
Traceback (most recent call last):
File "koboldcpp.py", line 3332, in
They very recently switched the tokenization for this model, so your quant format is newer than the newest Koboldcpp release. Hold on to that file since it will probably work better in our next release, feel free to link with it so we can double check if it works.
The one I just tested with does load but doesn't have the new tokenizer for it: https://huggingface.co/andrewcanis/c4ai-command-r-v01-GGUF/resolve/main/c4ai-command-r-v01-Q4_K_S.gguf?download=true
Side note, you are using 1.64 which has known clip bugs. If you care about Llava functionality in your models upgrade to 1.64.1.
Will there ever be a version with Command-R (https://huggingface.co/CohereForAI/c4ai-command-r-v01) support? llama.cpp supports this model for a long time.