[BUG] (v1.55.1) - MULTI GPU - ggml-cuda.cu:229: !"CUDA error"

LostRuins / koboldcpp

Run GGUF models easily with a KoboldAI UI. One File. Zero Install.

GNU Affero General Public License v3.0

5.2k stars 357 forks source link

*** Welcome to KoboldCpp - Version 1.55.1 For command line arguments, please refer to --help *** Attempting to use CuBLAS library for faster prompt ingestion. A compatible CuBLAS will be required. Initializing dynamic library: koboldcpp_cublas.dll ========== Namespace(bantokens=None, blasbatchsize=512, blasthreads=31, config=None, contextsize=32768, debugmode=0, forceversion=0, foreground=False, gpulayers=14, highpriority=False, hordeconfig=None, host='', launch=True, lora=None, model=None, model_param='C:/KoboldCPP/Models/dolphin-2.7-mixtral-8x7b.Q6_K.gguf', multiuser=1, noavx2=False, noblas=False, nommap=False, noshift=False, onready='', port=5001, port_param=5001, preloadstory=None, quiet=False, remotetunnel=False, ropeconfig=[0.0, 10000.0], skiplauncher=False, smartcontext=False, ssl=None, tensor_split=None, threads=31, useclblast=None, usecublas=['normal', 'mmq'], usemlock=True) ========== Loading model: C:\KoboldCPP\Models\dolphin-2.7-mixtral-8x7b.Q6_K.gguf [Threads: 31, BlasThreads: 31, SmartContext: False, ContextShift: True] The reported GGUF Arch is: llama --- Identified as LLAMA model: (ver 6) Attempting to Load... --- Using automatic RoPE scaling. If the model has customized RoPE settings, they will be used directly instead! System Info: AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 0 | VSX = 0 | ggml_init_cublas: found 2 CUDA devices: Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes Device 1: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes llama_model_loader: loaded meta data with 24 key-value pairs and 995 tensors from C:\KoboldCPP\Models\dolphin-2.7-mixtral-8x7b.Q'oﾋｺ(llm_load_vocab: special tokens definition check successful ( 261/32002 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 32002 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 32768 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_layer = 32 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 4 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 8 llm_load_print_meta: n_expert_used = 2 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 1000000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 32768 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: model type = 7B llm_load_print_meta: model ftype = unknown, may not work llm_load_print_meta: model params = 46.70 B llm_load_print_meta: model size = 35.74 GiB (6.57 BPW) llm_load_print_meta: general.name = cognitivecomputations_dolphin-2.7-mixtral-8x7b llm_load_print_meta: BOS token = 1 '<s>' llm_load_print_meta: EOS token = 32000 '<|im_end|>' llm_load_print_meta: UNK token = 0 '<unk>' llm_load_print_meta: LF token = 13 '<0x0A>' llm_load_tensors: ggml ctx size = 0.38 MiB llm_load_tensors: using CUDA for GPU acceleration llm_load_tensors: system memory used = 20677.67 MiB llm_load_tensors: VRAM used = 15922.81 MiB llm_load_tensors: offloading 14 repeating layers to GPU llm_load_tensors: offloaded 14/33 layers to GPU .................................................................................................... Automatic RoPE Scaling: Using model internal value. llama_new_context_with_model: n_ctx = 32848 llama_new_context_with_model: freq_base = 1000000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: VRAM kv self = 1796.38 MB llama_new_context_with_model: KV self size = 4106.00 MiB, K (f16): 2053.00 MiB, V (f16): 2053.00 MiB llama_build_graph: non-view tensors processed: 1124/1124 llama_new_context_with_model: compute buffer total size = 2172.38 MiB llama_new_context_with_model: VRAM scratch buffer: 2169.19 MiB llama_new_context_with_model: total VRAM used: 19888.38 MiB (model: 15922.81 MiB, context: 3965.57 MiB) CUDA error: out of memory current device: 0, in function ggml_cuda_assign_scratch_offset at d:\a\koboldcpp\koboldcpp\ggml-cuda.cu:9848 cudaMalloc(&g_scratch_buffer, g_scratch_size) GGML_ASSERT: d:\a\koboldcpp\koboldcpp\ggml-cuda.cu:229: !"CUDA error" [process exited with code 3221226505 (0xc0000409)]

Both of the cards are unused by the system - I am not gaming nor actively watching videos.

That is why I am confused why an equal split isn't working. Ditto for 4/6 with fewer layers, in case the KV cache or whatnot can only go onto the first card. I get better thresholds when Low VRAM mode is enabled, but ultimately I still OOM around 20gb out of the 24gb that should be available.

...Hold up. In the task manager, GPU 0 has 9.7gb used. However, GPU 1 is at 11.3. I have my split at 5/5. When KoboldCPP is shut down, GPU 0 rests at 0.5gb.

Welcome to KoboldCpp - Version 1.55.1 For command line arguments, please refer to --help

Attempting to use CuBLAS library for faster prompt ingestion. A compatible CuBLAS will be required. Initializing dynamic library: koboldcpp_cublas.dll

Namespace(bantokens=None, blasbatchsize=512, blasthreads=31, config=None, contextsize=32768, debugmode=0, forceversion=0, foreground=False, gpulayers=15, highpriority=False, hordeconfig=None, host='', launch=True, lora=None, model=None, model_param='C:/KoboldCPP/Models/dolphin-2.7-mixtral-8x7b.Q6_K.gguf', multiuser=1, noavx2=False, noblas=False, nommap=False, noshift=False, onready='', port=5001, port_param=5001, preloadstory=None, quiet=False, remotetunnel=False, ropeconfig=[0.0, 10000.0], skiplauncher=False, smartcontext=False, ssl=None, tensor_split=[5.0, 5.0], threads=31, useclblast=None, usecublas=['lowvram', 'mmq'], usemlock=True)

Loading model: C:\KoboldCPP\Models\dolphin-2.7-mixtral-8x7b.Q6_K.gguf [Threads: 31, BlasThreads: 31, SmartContext: False, ContextShift: True]

The reported GGUF Arch is: llama

Identified as LLAMA model: (ver 6) Attempting to Load...

Using automatic RoPE scaling. If the model has customized RoPE settings, they will be used directly instead! System Info: AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 0 | VSX = 0 | ggml_init_cublas: found 2 CUDA devices: Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes Device 1: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes llama_model_loader: loaded meta data with 24 key-value pairs and 995 tensors from C:\KoboldCPP\Models\dolphin-2.7-mixtral-8x7b.Qｭ+R・llm_load_vocab: special tokens definition check successful ( 261/32002 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 32002 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 32768 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_layer = 32 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 4 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 8 llm_load_print_meta: n_expert_used = 2 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 1000000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 32768 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: model type = 7B llm_load_print_meta: model ftype = unknown, may not work llm_load_print_meta: model params = 46.70 B llm_load_print_meta: model size = 35.74 GiB (6.57 BPW) llm_load_print_meta: general.name = cognitivecomputations_dolphin-2.7-mixtral-8x7b llm_load_print_meta: BOS token = 1 '' llm_load_print_meta: EOS token = 32000 '<|im_end|>' llm_load_print_meta: UNK token = 0 '' llm_load_print_meta: LF token = 13 '<0x0A>' llm_load_tensors: ggml ctx size = 0.38 MiB llm_load_tensors: using CUDA for GPU acceleration llm_load_tensors: system memory used = 19540.33 MiB llm_load_tensors: VRAM used = 17060.16 MiB llm_load_tensors: offloading 15 repeating layers to GPU llm_load_tensors: offloaded 15/33 layers to GPU .................................................................................................... Automatic RoPE Scaling: Using model internal value. llama_new_context_with_model: n_ctx = 32848 llama_new_context_with_model: freq_base = 1000000.0 llama_new_context_with_model: freq_scale = 1 llama_new_context_with_model: KV self size = 4106.00 MiB, K (f16): 2053.00 MiB, V (f16): 2053.00 MiB llama_build_graph: non-view tensors processed: 1124/1124 llama_new_context_with_model: compute buffer total size = 2172.38 MiB llama_new_context_with_model: VRAM scratch buffer: 2169.19 MiB llama_new_context_with_model: total VRAM used: 19229.35 MiB (model: 17060.16 MiB, context: 2169.19 MiB) Load Model OK: True Embedded Kobold Lite loaded. Starting Kobold API on port 5001 at http://localhost:5001/api/ Starting OpenAI Compatible API on port 5001 at http://localhost:5001/v1/

~~Please connect to custom endpoint at http://localhost:5001~~

Not sure if I can. I replaced one of the 3060s with a 4090, so I can't replicate my setup. It is difficult to nail down how much space the layers and KV cache requires for each card, so that I can deliver a legit report.

Here are my results for the failed bootup. I need to leave enough memory on the 3060 for my browser, gaming, and Windows, so I can only use about 8ish gigs for AI. I think there is enough memory left on both cards for the failed 30-layer bootup, but I am not sure. I reduced layer count until I got a successful boot, at 26 layers. I tried again with 38 layers, at 7-3 split in Low VRAM mode. It is much easier to understand the VRAM taken in that mode, and successfully booted.

FAILED BOOTUP

Welcome to KoboldCpp - Version 1.56 For command line arguments, please refer to --help

Attempting to use CuBLAS library for faster prompt ingestion. A compatible CuBLAS will be required. Initializing dynamic library: koboldcpp_cublas.dll

Namespace(bantokens=None, blasbatchsize=512, blasthreads=31, config=None, contextsize=32768, debugmode=0, forceversion=0, foreground=False, gpulayers=30, highpriority=False, hordeconfig=None, host='', launch=True, lora=None, model=None, model_param='C:/KoboldCPP/Models/bagel-hermes-2x34b.Q6_K.gguf', multiuser=1, noavx2=False, noblas=False, nommap=False, noshift=False, onready='', port=5001, port_param=5001, preloadstory=None, quiet=False, remotetunnel=False, ropeconfig=[0.0, 10000.0], skiplauncher=False, smartcontext=False, ssl=None, tensor_split=[8.0, 2.0], threads=31, useclblast=None, usecublas=['normal', 'mmq'], usemlock=True, usevulkan=None)

Loading model: C:\KoboldCPP\Models\bagel-hermes-2x34b.Q6_K.gguf [Threads: 31, BlasThreads: 31, SmartContext: False, ContextShift: True]

The reported GGUF Arch is: llama

Identified as GGUF model: (ver 6) Attempting to Load...

Using automatic RoPE scaling. If the model has customized RoPE settings, they will be used directly instead! System Info: AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 0 | VSX = 0 | ggml_init_cublas: found 2 CUDA devices: Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes Device 1: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes llama_model_loader: loaded meta data with 26 key-value pairs and 783 tensors from C:\KoboldCPP\Models\bagel-hermes-2x34b.Q6_K.ggﾎllm_load_vocab: mismatch in special tokens definition ( 498/64000 vs 267/64000 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 64000 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 200000 llm_load_print_meta: n_embd = 7168 llm_load_print_meta: n_head = 56 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_layer = 60 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 7 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: n_ff = 20480 llm_load_print_meta: n_expert = 2 llm_load_print_meta: n_expert_used = 2 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 5000000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 200000 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: model type = 30B llm_load_print_meta: model ftype = unknown, may not work llm_load_print_meta: model params = 60.81 B llm_load_print_meta: model size = 46.46 GiB (6.56 BPW) llm_load_print_meta: general.name = weyaxi_bagel-hermes-2x34b llm_load_print_meta: BOS token = 1 '<|startoftext|>' llm_load_print_meta: EOS token = 2 '<|endoftext|>' llm_load_print_meta: UNK token = 0 '' llm_load_print_meta: PAD token = 1 '<|startoftext|>' llm_load_print_meta: LF token = 315 '<0x0A>' llm_load_tensors: ggml ctx size = 0.90 MiB llm_load_tensors: offloading 30 repeating layers to GPU llm_load_tensors: offloaded 30/61 layers to GPU llm_load_tensors: CPU buffer size = 47578.97 MiB llm_load_tensors: CUDA0 buffer size = 18744.47 MiB llm_load_tensors: CUDA1 buffer size = 4686.12 MiB .................................................................................................... Automatic RoPE Scaling: Using model internal value. llama_new_context_with_model: n_ctx = 32848 llama_new_context_with_model: freq_base = 5000000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA_Host KV buffer size = 3849.38 MiB llama_kv_cache_init: CUDA0 KV buffer size = 3079.50 MiB llama_kv_cache_init: CUDA1 KV buffer size = 769.88 MiB llama_new_context_with_model: KV self size = 7698.75 MiB, K (f16): 3849.38 MiB, V (f16): 3849.38 MiB llama_new_context_with_model: CUDA_Host input buffer size = 78.29 MiB ggml_backend_cuda_buffer_type_alloc_buffer: allocating 4128.20 MiB on device 0: cudaMalloc failed: out of memory Traceback (most recent call last): File "koboldcpp.py", line 2580, in File "koboldcpp.py", line 2426, in main File "koboldcpp.py", line 328, in load_model OSError: exception: access violation reading 0x0000000000000010 [31412] Failed to execute script 'koboldcpp' due to unhandled exception!

[process exited with code 1 (0x00000001)]

LostRuins / koboldcpp

[BUG] (v1.55.1) - MULTI GPU - ggml-cuda.cu:229: !"CUDA error" #612

Attempting to use CuBLAS library for faster prompt ingestion. A compatible CuBLAS will be required. Initializing dynamic library: koboldcpp_cublas.dll

Identified as LLAMA model: (ver 6) Attempting to Load...

Attempting to use CuBLAS library for faster prompt ingestion. A compatible CuBLAS will be required. Initializing dynamic library: koboldcpp_cublas.dll

Identified as GGUF model: (ver 6) Attempting to Load...