Closed SabinStargem closed 12 months ago
Is that is a supported model architecture? Maybe try running it with clblast or openblas first. Does it happen with llama.cpp main.exe too?
I don't use llama.cpp, so I can't report on that. Anyhow, the model works in OpenBLAS and ClBLAS. Below is the stuff from the terminal. All three modes have vocab mismatches. CuBLAS's LF token is 30, while OpenBLAS and CLBlas have 128, if that is relevant.
CUBLAS
Welcome to KoboldCpp - Version 1.47.1 For command line arguments, please refer to --help
Attempting to use CuBLAS library for faster prompt ingestion. A compatible CuBLAS will be required. Initializing dynamic library: koboldcpp_cublas.dll
Namespace(bantokens=None, blasbatchsize=512, blasthreads=6, config=None, contextsize=16384, debugmode=0, forceversion=0, foreground=False, gpulayers=40, highpriority=False, hordeconfig=None, host='', launch=True, lora=None, model=None, model_param='C:/KoboldCPP/Models/causallm_14b.Q5_1.gguf', multiuser=False, noavx2=False, noblas=False, nommap=False, onready='', port=5001, port_param=5001, ropeconfig=[0.0, 10000.0], skiplauncher=False, smartcontext=False, tensor_split=None, threads=6, useclblast=None, usecublas=['normal', '0', 'mmq'], usemlock=True)
Loading model: C:\KoboldCPP\Models\causallm_14b.Q5_1.gguf [Threads: 6, BlasThreads: 6, SmartContext: False]
Identified as LLAMA model: (ver 6) Attempting to Load...
Using automatic RoPE scaling (scale:1.000, base:32000.0) System Info: AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 0 | VSX = 0 | ggml_init_cublas: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6 llama_model_loader: loaded meta data with 21 key-value pairs and 363 tensors from C:\KoboldCPP\Models\causallm_14b.Q5_1.gguf (ve\f,wヨllm_load_vocab: mismatch in special tokens definition ( 421/152064 vs 213/152064 ). llm_load_print_meta: format = unknown llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 152064 llm_load_print_meta: n_merges = 109170 llm_load_print_meta: n_ctx_train = 8192 llm_load_print_meta: n_embd = 5120 llm_load_print_meta: n_head = 40 llm_load_print_meta: n_head_kv = 40 llm_load_print_meta: n_layer = 40 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_gqa = 1 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-06 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: n_ff = 13696 llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: model type = 13B llm_load_print_meta: model ftype = unknown, may not work llm_load_print_meta: model params = 14.17 B llm_load_print_meta: model size = 9.95 GiB (6.03 BPW) llm_load_print_meta: general.name = causallm_14b llm_load_print_meta: BOS token = 151643 '<|endoftext|>' llm_load_print_meta: EOS token = 151643 '<|endoftext|>' llm_load_print_meta: PAD token = 151643 '<|endoftext|>' llm_load_print_meta: LF token = 30 '?' llm_load_tensors: ggml ctx size = 0.12 MB llm_load_tensors: using CUDA for GPU acceleration llm_load_tensors: mem required = 1166.10 MB llm_load_tensors: offloading 40 repeating layers to GPU llm_load_tensors: offloaded 40/43 layers to GPU llm_load_tensors: VRAM used: 9020.31 MB ........................................................................................... llama_new_context_with_model: n_ctx = 16384 llama_new_context_with_model: freq_base = 32000.0 llama_new_context_with_model: freq_scale = 1 llama_new_context_with_model: kv self size = 12800.00 MB llama_new_context_with_model: compute buffer total size = 1348.13 MB llama_new_context_with_model: VRAM scratch buffer: 1342.00 MB llama_new_context_with_model: total VRAM used: 10362.32 MB (model: 9020.31 MB, context: 1342.00 MB) Load Model OK: True Embedded Kobold Lite loaded. Starting Kobold HTTP Server on port 5001 Please connect to custom endpoint at http://localhost:5001
CLBLAS
Welcome to KoboldCpp - Version 1.47.1 For command line arguments, please refer to --help
Attempting to use CLBlast library for faster prompt ingestion. A compatible clblast will be required. Initializing dynamic library: koboldcpp_clblast.dll
Namespace(bantokens=None, blasbatchsize=512, blasthreads=6, config=None, contextsize=16384, debugmode=0, forceversion=0, foreground=False, gpulayers=40, highpriority=False, hordeconfig=None, host='', launch=True, lora=None, model=None, model_param='C:/KoboldCPP/Models/causallm_14b.Q5_1.gguf', multiuser=False, noavx2=False, noblas=False, nommap=False, onready='', port=5001, port_param=5001, ropeconfig=[0.0, 10000.0], skiplauncher=False, smartcontext=False, tensor_split=None, threads=6, useclblast=[0, 0], usecublas=None, usemlock=True)
Loading model: C:\KoboldCPP\Models\causallm_14b.Q5_1.gguf [Threads: 6, BlasThreads: 6, SmartContext: False]
Identified as LLAMA model: (ver 6) Attempting to Load...
Using automatic RoPE scaling (scale:1.000, base:32000.0) System Info: AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
Platform:0 Device:0 - NVIDIA CUDA with NVIDIA GeForce RTX 3060
ggml_opencl: selecting platform: 'NVIDIA CUDA' ggml_opencl: selecting device: 'NVIDIA GeForce RTX 3060' ggml_opencl: device FP16 support: false CL FP16 temporarily disabled pending further optimization. llama_model_loader: loaded meta data with 21 key-value pairs and 363 tensors from C:\KoboldCPP\Models\causallm_14b.Q5_1.gguf (version unknown) llm_load_vocab: mismatch in special tokens definition ( 421/152064 vs 213/152064 ). llm_load_print_meta: format = unknown llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 152064 llm_load_print_meta: n_merges = 109170 llm_load_print_meta: n_ctx_train = 8192 llm_load_print_meta: n_embd = 5120 llm_load_print_meta: n_head = 40 llm_load_print_meta: n_head_kv = 40 llm_load_print_meta: n_layer = 40 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_gqa = 1 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-06 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: n_ff = 13696 llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: model type = 13B llm_load_print_meta: model ftype = unknown, may not work llm_load_print_meta: model params = 14.17 B llm_load_print_meta: model size = 9.95 GiB (6.03 BPW) llm_load_print_meta: general.name = causallm_14b llm_load_print_meta: BOS token = 151643 '<|endoftext|>' llm_load_print_meta: EOS token = 151643 '<|endoftext|>' llm_load_print_meta: PAD token = 151643 '<|endoftext|>' llm_load_print_meta: LF token = 128 'テ・ llm_load_tensors: ggml ctx size = 0.12 MB llm_load_tensors: using OpenCL for GPU acceleration llm_load_tensors: mem required = 1166.10 MB llm_load_tensors: offloading 40 repeating layers to GPU llm_load_tensors: offloaded 40/41 layers to GPU llm_load_tensors: VRAM used: 9020.31 MB ........................................................................................... llama_new_context_with_model: n_ctx = 16384 llama_new_context_with_model: freq_base = 32000.0 llama_new_context_with_model: freq_scale = 1 llama_new_context_with_model: kv self size = 12800.00 MB llama_new_context_with_model: compute buffer total size = 1348.13 MB Load Model OK: True Embedded Kobold Lite loaded. Starting Kobold HTTP Server on port 5001 Please connect to custom endpoint at http://localhost:5001
OPENBLAS
Welcome to KoboldCpp - Version 1.47.1 For command line arguments, please refer to --help
Attempting to use OpenBLAS library for faster prompt ingestion. A compatible libopenblas will be required. Initializing dynamic library: koboldcpp_openblas.dll
Namespace(bantokens=None, blasbatchsize=512, blasthreads=6, config=None, contextsize=16384, debugmode=0, forceversion=0, foreground=False, gpulayers=40, highpriority=False, hordeconfig=None, host='', launch=True, lora=None, model=None, model_param='C:/KoboldCPP/Models/causallm_14b.Q5_1.gguf', multiuser=False, noavx2=False, noblas=False, nommap=False, onready='', port=5001, port_param=5001, ropeconfig=[0.0, 10000.0], skiplauncher=False, smartcontext=False, tensor_split=None, threads=6, useclblast=None, usecublas=None, usemlock=True)
Loading model: C:\KoboldCPP\Models\causallm_14b.Q5_1.gguf [Threads: 6, BlasThreads: 6, SmartContext: False]
Identified as LLAMA model: (ver 6) Attempting to Load...
Using automatic RoPE scaling (scale:1.000, base:32000.0) System Info: AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | llama_model_loader: loaded meta data with 21 key-value pairs and 363 tensors from C:\KoboldCPP\Models\causallm_14b.Q5_1.gguf (version unknown) llm_load_vocab: mismatch in special tokens definition ( 421/152064 vs 213/152064 ). llm_load_print_meta: format = unknown llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 152064 llm_load_print_meta: n_merges = 109170 llm_load_print_meta: n_ctx_train = 8192 llm_load_print_meta: n_embd = 5120 llm_load_print_meta: n_head = 40 llm_load_print_meta: n_head_kv = 40 llm_load_print_meta: n_layer = 40 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_gqa = 1 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-06 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: n_ff = 13696 llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: model type = 13B llm_load_print_meta: model ftype = unknown, may not work llm_load_print_meta: model params = 14.17 B llm_load_print_meta: model size = 9.95 GiB (6.03 BPW) llm_load_print_meta: general.name = causallm_14b llm_load_print_meta: BOS token = 151643 '<|endoftext|>' llm_load_print_meta: EOS token = 151643 '<|endoftext|>' llm_load_print_meta: PAD token = 151643 '<|endoftext|>' llm_load_print_meta: LF token = 128 'テ・ llm_load_tensors: ggml ctx size = 0.12 MB llm_load_tensors: mem required = 10186.41 MB ........................................................................................... llama_new_context_with_model: n_ctx = 16384 llama_new_context_with_model: freq_base = 32000.0 llama_new_context_with_model: freq_scale = 1 llama_new_context_with_model: kv self size = 12800.00 MB llama_new_context_with_model: compute buffer total size = 1348.13 MB Load Model OK: True Embedded Kobold Lite loaded. Starting Kobold HTTP Server on port 5001 Please connect to custom endpoint at http://localhost:5001
Okay first, you're on Version 1.47.1, try upgrading to Version 1.47.2 just in case.
Next, the CUDA kernel failed during a matrix multiplication. You are using a rather large contextsize of 16k, perhaps try reduce that to 2k first and try on cuda, see if it works? May want to lower a few other parameters too.
Still doesn't work with MM disabled and 2k context.
Please try v1.48, which should have fixed this issue.
CausalLM worked without issue. Good job.
When I tried to run CausalLM 14b, I got the following error message. There is a 7b version of CausalLM, if you need something a little smaller to examine.