Open SabinStargem opened 9 months ago
Yeah that's an out of memory error. When viewing in task manager, which card appears to use more VRAM? Try adjusting tensor_split
to assign less layers to that card.
Both of the cards are unused by the system - I am not gaming nor actively watching videos.
That is why I am confused why an equal split isn't working. Ditto for 4/6 with fewer layers, in case the KV cache or whatnot can only go onto the first card. I get better thresholds when Low VRAM mode is enabled, but ultimately I still OOM around 20gb out of the 24gb that should be available.
...Hold up. In the task manager, GPU 0 has 9.7gb used. However, GPU 1 is at 11.3. I have my split at 5/5. When KoboldCPP is shut down, GPU 0 rests at 0.5gb.
Welcome to KoboldCpp - Version 1.55.1 For command line arguments, please refer to --help
Attempting to use CuBLAS library for faster prompt ingestion. A compatible CuBLAS will be required. Initializing dynamic library: koboldcpp_cublas.dll
Namespace(bantokens=None, blasbatchsize=512, blasthreads=31, config=None, contextsize=32768, debugmode=0, forceversion=0, foreground=False, gpulayers=15, highpriority=False, hordeconfig=None, host='', launch=True, lora=None, model=None, model_param='C:/KoboldCPP/Models/dolphin-2.7-mixtral-8x7b.Q6_K.gguf', multiuser=1, noavx2=False, noblas=False, nommap=False, noshift=False, onready='', port=5001, port_param=5001, preloadstory=None, quiet=False, remotetunnel=False, ropeconfig=[0.0, 10000.0], skiplauncher=False, smartcontext=False, ssl=None, tensor_split=[5.0, 5.0], threads=31, useclblast=None, usecublas=['lowvram', 'mmq'], usemlock=True)
Loading model: C:\KoboldCPP\Models\dolphin-2.7-mixtral-8x7b.Q6_K.gguf [Threads: 31, BlasThreads: 31, SmartContext: False, ContextShift: True]
The reported GGUF Arch is: llama
Identified as LLAMA model: (ver 6) Attempting to Load...
Using automatic RoPE scaling. If the model has customized RoPE settings, they will be used directly instead! System Info: AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 0 | VSX = 0 | ggml_init_cublas: found 2 CUDA devices: Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes Device 1: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes llama_model_loader: loaded meta data with 24 key-value pairs and 995 tensors from C:\KoboldCPP\Models\dolphin-2.7-mixtral-8x7b.Qュ+R・llm_load_vocab: special tokens definition check successful ( 261/32002 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 32002 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 32768 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_layer = 32 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 4 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 8 llm_load_print_meta: n_expert_used = 2 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 1000000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 32768 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: model type = 7B llm_load_print_meta: model ftype = unknown, may not work llm_load_print_meta: model params = 46.70 B llm_load_print_meta: model size = 35.74 GiB (6.57 BPW) llm_load_print_meta: general.name = cognitivecomputations_dolphin-2.7-mixtral-8x7b llm_load_print_meta: BOS token = 1 '
' llm_load_print_meta: EOS token = 32000 '<|im_end|>' llm_load_print_meta: UNK token = 0 '' llm_load_print_meta: LF token = 13 '<0x0A>' llm_load_tensors: ggml ctx size = 0.38 MiB llm_load_tensors: using CUDA for GPU acceleration llm_load_tensors: system memory used = 19540.33 MiB llm_load_tensors: VRAM used = 17060.16 MiB llm_load_tensors: offloading 15 repeating layers to GPU llm_load_tensors: offloaded 15/33 layers to GPU .................................................................................................... Automatic RoPE Scaling: Using model internal value. llama_new_context_with_model: n_ctx = 32848 llama_new_context_with_model: freq_base = 1000000.0 llama_new_context_with_model: freq_scale = 1 llama_new_context_with_model: KV self size = 4106.00 MiB, K (f16): 2053.00 MiB, V (f16): 2053.00 MiB llama_build_graph: non-view tensors processed: 1124/1124 llama_new_context_with_model: compute buffer total size = 2172.38 MiB llama_new_context_with_model: VRAM scratch buffer: 2169.19 MiB llama_new_context_with_model: total VRAM used: 19229.35 MiB (model: 17060.16 MiB, context: 2169.19 MiB) Load Model OK: True Embedded Kobold Lite loaded. Starting Kobold API on port 5001 at http://localhost:5001/api/ Starting OpenAI Compatible API on port 5001 at http://localhost:5001/v1/ Please connect to custom endpoint at http://localhost:5001
Can you check if this is now fixed in 1.56? The backend was reworked.
Not sure if I can. I replaced one of the 3060s with a 4090, so I can't replicate my setup. It is difficult to nail down how much space the layers and KV cache requires for each card, so that I can deliver a legit report.
Here are my results for the failed bootup. I need to leave enough memory on the 3060 for my browser, gaming, and Windows, so I can only use about 8ish gigs for AI. I think there is enough memory left on both cards for the failed 30-layer bootup, but I am not sure. I reduced layer count until I got a successful boot, at 26 layers. I tried again with 38 layers, at 7-3 split in Low VRAM mode. It is much easier to understand the VRAM taken in that mode, and successfully booted.
FAILED BOOTUP
Welcome to KoboldCpp - Version 1.56 For command line arguments, please refer to --help
Attempting to use CuBLAS library for faster prompt ingestion. A compatible CuBLAS will be required. Initializing dynamic library: koboldcpp_cublas.dll
Namespace(bantokens=None, blasbatchsize=512, blasthreads=31, config=None, contextsize=32768, debugmode=0, forceversion=0, foreground=False, gpulayers=30, highpriority=False, hordeconfig=None, host='', launch=True, lora=None, model=None, model_param='C:/KoboldCPP/Models/bagel-hermes-2x34b.Q6_K.gguf', multiuser=1, noavx2=False, noblas=False, nommap=False, noshift=False, onready='', port=5001, port_param=5001, preloadstory=None, quiet=False, remotetunnel=False, ropeconfig=[0.0, 10000.0], skiplauncher=False, smartcontext=False, ssl=None, tensor_split=[8.0, 2.0], threads=31, useclblast=None, usecublas=['normal', 'mmq'], usemlock=True, usevulkan=None)
Loading model: C:\KoboldCPP\Models\bagel-hermes-2x34b.Q6_K.gguf [Threads: 31, BlasThreads: 31, SmartContext: False, ContextShift: True]
The reported GGUF Arch is: llama
Identified as GGUF model: (ver 6) Attempting to Load...
Using automatic RoPE scaling. If the model has customized RoPE settings, they will be used directly instead! System Info: AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 0 | VSX = 0 | ggml_init_cublas: found 2 CUDA devices: Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes Device 1: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes llama_model_loader: loaded meta data with 26 key-value pairs and 783 tensors from C:\KoboldCPP\Models\bagel-hermes-2x34b.Q6_K.ggホllm_load_vocab: mismatch in special tokens definition ( 498/64000 vs 267/64000 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 64000 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 200000 llm_load_print_meta: n_embd = 7168 llm_load_print_meta: n_head = 56 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_layer = 60 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 7 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: n_ff = 20480 llm_load_print_meta: n_expert = 2 llm_load_print_meta: n_expert_used = 2 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 5000000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 200000 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: model type = 30B llm_load_print_meta: model ftype = unknown, may not work llm_load_print_meta: model params = 60.81 B llm_load_print_meta: model size = 46.46 GiB (6.56 BPW) llm_load_print_meta: general.name = weyaxi_bagel-hermes-2x34b llm_load_print_meta: BOS token = 1 '<|startoftext|>' llm_load_print_meta: EOS token = 2 '<|endoftext|>' llm_load_print_meta: UNK token = 0 '
' llm_load_print_meta: PAD token = 1 '<|startoftext|>' llm_load_print_meta: LF token = 315 '<0x0A>' llm_load_tensors: ggml ctx size = 0.90 MiB llm_load_tensors: offloading 30 repeating layers to GPU llm_load_tensors: offloaded 30/61 layers to GPU llm_load_tensors: CPU buffer size = 47578.97 MiB llm_load_tensors: CUDA0 buffer size = 18744.47 MiB llm_load_tensors: CUDA1 buffer size = 4686.12 MiB .................................................................................................... Automatic RoPE Scaling: Using model internal value. llama_new_context_with_model: n_ctx = 32848 llama_new_context_with_model: freq_base = 5000000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA_Host KV buffer size = 3849.38 MiB llama_kv_cache_init: CUDA0 KV buffer size = 3079.50 MiB llama_kv_cache_init: CUDA1 KV buffer size = 769.88 MiB llama_new_context_with_model: KV self size = 7698.75 MiB, K (f16): 3849.38 MiB, V (f16): 3849.38 MiB llama_new_context_with_model: CUDA_Host input buffer size = 78.29 MiB ggml_backend_cuda_buffer_type_alloc_buffer: allocating 4128.20 MiB on device 0: cudaMalloc failed: out of memory Traceback (most recent call last): File "koboldcpp.py", line 2580, in File "koboldcpp.py", line 2426, in main File "koboldcpp.py", line 328, in load_model OSError: exception: access violation reading 0x0000000000000010 [31412] Failed to execute script 'koboldcpp' due to unhandled exception! [process exited with code 1 (0x00000001)]
Okay one more thing you can try is swapping the maingpu.
When you use the --usecublas
flag, these are the args for it [lowvram|normal] [main GPU ID] [mmq]
So to use second GPU as the main one instead with a 80:20 split, use
--usecublas normal 1 mmq --tensor_split 8.0 2.0
Some people have reported using a different main GPU helps.
I was using a pair of 3060 12gb cards, and got the error below. With the settings I had, about 19gb would be taken by VRAM, the remaining 20gb as RAM. Using a single card at 7 layers, I successfully booted.