LostRuins / koboldcpp

Run GGUF models easily with a KoboldAI UI. One File. Zero Install.
https://github.com/lostruins/koboldcpp
GNU Affero General Public License v3.0
5.17k stars 354 forks source link

[BUG] (v1.4.7) -> CausalLM 14b, CUDA error 9 #490

Closed SabinStargem closed 12 months ago

SabinStargem commented 1 year ago

When I tried to run CausalLM 14b, I got the following error message. There is a 7b version of CausalLM, if you need something a little smaller to examine.

Processing Prompt [BLAS] (4808 / 4808 tokens) Generating (1 / 1024 tokens) CUDA error 9 at C:\Users\HCH\Desktop\koboldcpp\ggml-cuda.cu:6858: invalid configuration argument current device: 0

[process exited with code 1 (0x00000001)]

LostRuins commented 1 year ago

Is that is a supported model architecture? Maybe try running it with clblast or openblas first. Does it happen with llama.cpp main.exe too?

SabinStargem commented 1 year ago

I don't use llama.cpp, so I can't report on that. Anyhow, the model works in OpenBLAS and ClBLAS. Below is the stuff from the terminal. All three modes have vocab mismatches. CuBLAS's LF token is 30, while OpenBLAS and CLBlas have 128, if that is relevant.

CUBLAS


Welcome to KoboldCpp - Version 1.47.1 For command line arguments, please refer to --help


Attempting to use CuBLAS library for faster prompt ingestion. A compatible CuBLAS will be required. Initializing dynamic library: koboldcpp_cublas.dll

Namespace(bantokens=None, blasbatchsize=512, blasthreads=6, config=None, contextsize=16384, debugmode=0, forceversion=0, foreground=False, gpulayers=40, highpriority=False, hordeconfig=None, host='', launch=True, lora=None, model=None, model_param='C:/KoboldCPP/Models/causallm_14b.Q5_1.gguf', multiuser=False, noavx2=False, noblas=False, nommap=False, onready='', port=5001, port_param=5001, ropeconfig=[0.0, 10000.0], skiplauncher=False, smartcontext=False, tensor_split=None, threads=6, useclblast=None, usecublas=['normal', '0', 'mmq'], usemlock=True)

Loading model: C:\KoboldCPP\Models\causallm_14b.Q5_1.gguf [Threads: 6, BlasThreads: 6, SmartContext: False]


Identified as LLAMA model: (ver 6) Attempting to Load...

Using automatic RoPE scaling (scale:1.000, base:32000.0) System Info: AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 0 | VSX = 0 | ggml_init_cublas: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6 llama_model_loader: loaded meta data with 21 key-value pairs and 363 tensors from C:\KoboldCPP\Models\causallm_14b.Q5_1.gguf (ve\f,wヨllm_load_vocab: mismatch in special tokens definition ( 421/152064 vs 213/152064 ). llm_load_print_meta: format = unknown llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 152064 llm_load_print_meta: n_merges = 109170 llm_load_print_meta: n_ctx_train = 8192 llm_load_print_meta: n_embd = 5120 llm_load_print_meta: n_head = 40 llm_load_print_meta: n_head_kv = 40 llm_load_print_meta: n_layer = 40 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_gqa = 1 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-06 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: n_ff = 13696 llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: model type = 13B llm_load_print_meta: model ftype = unknown, may not work llm_load_print_meta: model params = 14.17 B llm_load_print_meta: model size = 9.95 GiB (6.03 BPW) llm_load_print_meta: general.name = causallm_14b llm_load_print_meta: BOS token = 151643 '<|endoftext|>' llm_load_print_meta: EOS token = 151643 '<|endoftext|>' llm_load_print_meta: PAD token = 151643 '<|endoftext|>' llm_load_print_meta: LF token = 30 '?' llm_load_tensors: ggml ctx size = 0.12 MB llm_load_tensors: using CUDA for GPU acceleration llm_load_tensors: mem required = 1166.10 MB llm_load_tensors: offloading 40 repeating layers to GPU llm_load_tensors: offloaded 40/43 layers to GPU llm_load_tensors: VRAM used: 9020.31 MB ........................................................................................... llama_new_context_with_model: n_ctx = 16384 llama_new_context_with_model: freq_base = 32000.0 llama_new_context_with_model: freq_scale = 1 llama_new_context_with_model: kv self size = 12800.00 MB llama_new_context_with_model: compute buffer total size = 1348.13 MB llama_new_context_with_model: VRAM scratch buffer: 1342.00 MB llama_new_context_with_model: total VRAM used: 10362.32 MB (model: 9020.31 MB, context: 1342.00 MB) Load Model OK: True Embedded Kobold Lite loaded. Starting Kobold HTTP Server on port 5001 Please connect to custom endpoint at http://localhost:5001

CLBLAS


Welcome to KoboldCpp - Version 1.47.1 For command line arguments, please refer to --help


Attempting to use CLBlast library for faster prompt ingestion. A compatible clblast will be required. Initializing dynamic library: koboldcpp_clblast.dll

Namespace(bantokens=None, blasbatchsize=512, blasthreads=6, config=None, contextsize=16384, debugmode=0, forceversion=0, foreground=False, gpulayers=40, highpriority=False, hordeconfig=None, host='', launch=True, lora=None, model=None, model_param='C:/KoboldCPP/Models/causallm_14b.Q5_1.gguf', multiuser=False, noavx2=False, noblas=False, nommap=False, onready='', port=5001, port_param=5001, ropeconfig=[0.0, 10000.0], skiplauncher=False, smartcontext=False, tensor_split=None, threads=6, useclblast=[0, 0], usecublas=None, usemlock=True)

Loading model: C:\KoboldCPP\Models\causallm_14b.Q5_1.gguf [Threads: 6, BlasThreads: 6, SmartContext: False]


Identified as LLAMA model: (ver 6) Attempting to Load...

Using automatic RoPE scaling (scale:1.000, base:32000.0) System Info: AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |

Platform:0 Device:0 - NVIDIA CUDA with NVIDIA GeForce RTX 3060

ggml_opencl: selecting platform: 'NVIDIA CUDA' ggml_opencl: selecting device: 'NVIDIA GeForce RTX 3060' ggml_opencl: device FP16 support: false CL FP16 temporarily disabled pending further optimization. llama_model_loader: loaded meta data with 21 key-value pairs and 363 tensors from C:\KoboldCPP\Models\causallm_14b.Q5_1.gguf (version unknown) llm_load_vocab: mismatch in special tokens definition ( 421/152064 vs 213/152064 ). llm_load_print_meta: format = unknown llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 152064 llm_load_print_meta: n_merges = 109170 llm_load_print_meta: n_ctx_train = 8192 llm_load_print_meta: n_embd = 5120 llm_load_print_meta: n_head = 40 llm_load_print_meta: n_head_kv = 40 llm_load_print_meta: n_layer = 40 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_gqa = 1 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-06 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: n_ff = 13696 llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: model type = 13B llm_load_print_meta: model ftype = unknown, may not work llm_load_print_meta: model params = 14.17 B llm_load_print_meta: model size = 9.95 GiB (6.03 BPW) llm_load_print_meta: general.name = causallm_14b llm_load_print_meta: BOS token = 151643 '<|endoftext|>' llm_load_print_meta: EOS token = 151643 '<|endoftext|>' llm_load_print_meta: PAD token = 151643 '<|endoftext|>' llm_load_print_meta: LF token = 128 'テ・ llm_load_tensors: ggml ctx size = 0.12 MB llm_load_tensors: using OpenCL for GPU acceleration llm_load_tensors: mem required = 1166.10 MB llm_load_tensors: offloading 40 repeating layers to GPU llm_load_tensors: offloaded 40/41 layers to GPU llm_load_tensors: VRAM used: 9020.31 MB ........................................................................................... llama_new_context_with_model: n_ctx = 16384 llama_new_context_with_model: freq_base = 32000.0 llama_new_context_with_model: freq_scale = 1 llama_new_context_with_model: kv self size = 12800.00 MB llama_new_context_with_model: compute buffer total size = 1348.13 MB Load Model OK: True Embedded Kobold Lite loaded. Starting Kobold HTTP Server on port 5001 Please connect to custom endpoint at http://localhost:5001

OPENBLAS


Welcome to KoboldCpp - Version 1.47.1 For command line arguments, please refer to --help


Attempting to use OpenBLAS library for faster prompt ingestion. A compatible libopenblas will be required. Initializing dynamic library: koboldcpp_openblas.dll

Namespace(bantokens=None, blasbatchsize=512, blasthreads=6, config=None, contextsize=16384, debugmode=0, forceversion=0, foreground=False, gpulayers=40, highpriority=False, hordeconfig=None, host='', launch=True, lora=None, model=None, model_param='C:/KoboldCPP/Models/causallm_14b.Q5_1.gguf', multiuser=False, noavx2=False, noblas=False, nommap=False, onready='', port=5001, port_param=5001, ropeconfig=[0.0, 10000.0], skiplauncher=False, smartcontext=False, tensor_split=None, threads=6, useclblast=None, usecublas=None, usemlock=True)

Loading model: C:\KoboldCPP\Models\causallm_14b.Q5_1.gguf [Threads: 6, BlasThreads: 6, SmartContext: False]


Identified as LLAMA model: (ver 6) Attempting to Load...

Using automatic RoPE scaling (scale:1.000, base:32000.0) System Info: AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | llama_model_loader: loaded meta data with 21 key-value pairs and 363 tensors from C:\KoboldCPP\Models\causallm_14b.Q5_1.gguf (version unknown) llm_load_vocab: mismatch in special tokens definition ( 421/152064 vs 213/152064 ). llm_load_print_meta: format = unknown llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 152064 llm_load_print_meta: n_merges = 109170 llm_load_print_meta: n_ctx_train = 8192 llm_load_print_meta: n_embd = 5120 llm_load_print_meta: n_head = 40 llm_load_print_meta: n_head_kv = 40 llm_load_print_meta: n_layer = 40 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_gqa = 1 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-06 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: n_ff = 13696 llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: model type = 13B llm_load_print_meta: model ftype = unknown, may not work llm_load_print_meta: model params = 14.17 B llm_load_print_meta: model size = 9.95 GiB (6.03 BPW) llm_load_print_meta: general.name = causallm_14b llm_load_print_meta: BOS token = 151643 '<|endoftext|>' llm_load_print_meta: EOS token = 151643 '<|endoftext|>' llm_load_print_meta: PAD token = 151643 '<|endoftext|>' llm_load_print_meta: LF token = 128 'テ・ llm_load_tensors: ggml ctx size = 0.12 MB llm_load_tensors: mem required = 10186.41 MB ........................................................................................... llama_new_context_with_model: n_ctx = 16384 llama_new_context_with_model: freq_base = 32000.0 llama_new_context_with_model: freq_scale = 1 llama_new_context_with_model: kv self size = 12800.00 MB llama_new_context_with_model: compute buffer total size = 1348.13 MB Load Model OK: True Embedded Kobold Lite loaded. Starting Kobold HTTP Server on port 5001 Please connect to custom endpoint at http://localhost:5001

LostRuins commented 1 year ago

Okay first, you're on Version 1.47.1, try upgrading to Version 1.47.2 just in case.

Next, the CUDA kernel failed during a matrix multiplication. You are using a rather large contextsize of 16k, perhaps try reduce that to 2k first and try on cuda, see if it works? May want to lower a few other parameters too.

SabinStargem commented 1 year ago

Still doesn't work with MM disabled and 2k context.

LostRuins commented 12 months ago

Please try v1.48, which should have fixed this issue.

SabinStargem commented 12 months ago

CausalLM worked without issue. Good job.