LostRuins / koboldcpp

Run GGUF models easily with a KoboldAI UI. One File. Zero Install.
https://github.com/lostruins/koboldcpp
GNU Affero General Public License v3.0
5.2k stars 357 forks source link

[BUG] (v1.55.1) - MULTI GPU - ggml-cuda.cu:229: !"CUDA error" #612

Open SabinStargem opened 9 months ago

SabinStargem commented 9 months ago

I was using a pair of 3060 12gb cards, and got the error below. With the settings I had, about 19gb would be taken by VRAM, the remaining 20gb as RAM. Using a single card at 7 layers, I successfully booted.

***
Welcome to KoboldCpp - Version 1.55.1
For command line arguments, please refer to --help
***
Attempting to use CuBLAS library for faster prompt ingestion. A compatible CuBLAS will be required.
Initializing dynamic library: koboldcpp_cublas.dll
==========
Namespace(bantokens=None, blasbatchsize=512, blasthreads=31, config=None, contextsize=32768, debugmode=0, forceversion=0, foreground=False, gpulayers=14, highpriority=False, hordeconfig=None, host='', launch=True, lora=None, model=None, model_param='C:/KoboldCPP/Models/dolphin-2.7-mixtral-8x7b.Q6_K.gguf', multiuser=1, noavx2=False, noblas=False, nommap=False, noshift=False, onready='', port=5001, port_param=5001, preloadstory=None, quiet=False, remotetunnel=False, ropeconfig=[0.0, 10000.0], skiplauncher=False, smartcontext=False, ssl=None, tensor_split=None, threads=31, useclblast=None, usecublas=['normal', 'mmq'], usemlock=True)
==========
Loading model: C:\KoboldCPP\Models\dolphin-2.7-mixtral-8x7b.Q6_K.gguf
[Threads: 31, BlasThreads: 31, SmartContext: False, ContextShift: True]

The reported GGUF Arch is: llama

---
Identified as LLAMA model: (ver 6)
Attempting to Load...
---
Using automatic RoPE scaling. If the model has customized RoPE settings, they will be used directly instead!
System Info: AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 0 | VSX = 0 |
ggml_init_cublas: found 2 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
  Device 1: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
llama_model_loader: loaded meta data with 24 key-value pairs and 995 tensors from C:\KoboldCPP\Models\dolphin-2.7-mixtral-8x7b.Q'oヒコ(llm_load_vocab: special tokens definition check successful ( 261/32002 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32002
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 8
llm_load_print_meta: n_expert_used    = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = unknown, may not work
llm_load_print_meta: model params     = 46.70 B
llm_load_print_meta: model size       = 35.74 GiB (6.57 BPW)
llm_load_print_meta: general.name     = cognitivecomputations_dolphin-2.7-mixtral-8x7b
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 32000 '<|im_end|>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size       =    0.38 MiB
llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: system memory used  = 20677.67 MiB
llm_load_tensors: VRAM used           = 15922.81 MiB
llm_load_tensors: offloading 14 repeating layers to GPU
llm_load_tensors: offloaded 14/33 layers to GPU
....................................................................................................
Automatic RoPE Scaling: Using model internal value.
llama_new_context_with_model: n_ctx      = 32848
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: VRAM kv self = 1796.38 MB
llama_new_context_with_model: KV self size  = 4106.00 MiB, K (f16): 2053.00 MiB, V (f16): 2053.00 MiB
llama_build_graph: non-view tensors processed: 1124/1124
llama_new_context_with_model: compute buffer total size = 2172.38 MiB
llama_new_context_with_model: VRAM scratch buffer: 2169.19 MiB
llama_new_context_with_model: total VRAM used: 19888.38 MiB (model: 15922.81 MiB, context: 3965.57 MiB)
CUDA error: out of memory
  current device: 0, in function ggml_cuda_assign_scratch_offset at d:\a\koboldcpp\koboldcpp\ggml-cuda.cu:9848
  cudaMalloc(&g_scratch_buffer, g_scratch_size)
GGML_ASSERT: d:\a\koboldcpp\koboldcpp\ggml-cuda.cu:229: !"CUDA error"

[process exited with code 3221226505 (0xc0000409)]
LostRuins commented 9 months ago

Yeah that's an out of memory error. When viewing in task manager, which card appears to use more VRAM? Try adjusting tensor_split to assign less layers to that card.

SabinStargem commented 9 months ago

Both of the cards are unused by the system - I am not gaming nor actively watching videos.

That is why I am confused why an equal split isn't working. Ditto for 4/6 with fewer layers, in case the KV cache or whatnot can only go onto the first card. I get better thresholds when Low VRAM mode is enabled, but ultimately I still OOM around 20gb out of the 24gb that should be available.

...Hold up. In the task manager, GPU 0 has 9.7gb used. However, GPU 1 is at 11.3. I have my split at 5/5. When KoboldCPP is shut down, GPU 0 rests at 0.5gb.


Welcome to KoboldCpp - Version 1.55.1 For command line arguments, please refer to --help


Attempting to use CuBLAS library for faster prompt ingestion. A compatible CuBLAS will be required. Initializing dynamic library: koboldcpp_cublas.dll

Namespace(bantokens=None, blasbatchsize=512, blasthreads=31, config=None, contextsize=32768, debugmode=0, forceversion=0, foreground=False, gpulayers=15, highpriority=False, hordeconfig=None, host='', launch=True, lora=None, model=None, model_param='C:/KoboldCPP/Models/dolphin-2.7-mixtral-8x7b.Q6_K.gguf', multiuser=1, noavx2=False, noblas=False, nommap=False, noshift=False, onready='', port=5001, port_param=5001, preloadstory=None, quiet=False, remotetunnel=False, ropeconfig=[0.0, 10000.0], skiplauncher=False, smartcontext=False, ssl=None, tensor_split=[5.0, 5.0], threads=31, useclblast=None, usecublas=['lowvram', 'mmq'], usemlock=True)

Loading model: C:\KoboldCPP\Models\dolphin-2.7-mixtral-8x7b.Q6_K.gguf [Threads: 31, BlasThreads: 31, SmartContext: False, ContextShift: True]

The reported GGUF Arch is: llama


Identified as LLAMA model: (ver 6) Attempting to Load...

Using automatic RoPE scaling. If the model has customized RoPE settings, they will be used directly instead! System Info: AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 0 | VSX = 0 | ggml_init_cublas: found 2 CUDA devices: Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes Device 1: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes llama_model_loader: loaded meta data with 24 key-value pairs and 995 tensors from C:\KoboldCPP\Models\dolphin-2.7-mixtral-8x7b.Qュ+R・llm_load_vocab: special tokens definition check successful ( 261/32002 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 32002 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 32768 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_layer = 32 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 4 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 8 llm_load_print_meta: n_expert_used = 2 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 1000000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 32768 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: model type = 7B llm_load_print_meta: model ftype = unknown, may not work llm_load_print_meta: model params = 46.70 B llm_load_print_meta: model size = 35.74 GiB (6.57 BPW) llm_load_print_meta: general.name = cognitivecomputations_dolphin-2.7-mixtral-8x7b llm_load_print_meta: BOS token = 1 '' llm_load_print_meta: EOS token = 32000 '<|im_end|>' llm_load_print_meta: UNK token = 0 '' llm_load_print_meta: LF token = 13 '<0x0A>' llm_load_tensors: ggml ctx size = 0.38 MiB llm_load_tensors: using CUDA for GPU acceleration llm_load_tensors: system memory used = 19540.33 MiB llm_load_tensors: VRAM used = 17060.16 MiB llm_load_tensors: offloading 15 repeating layers to GPU llm_load_tensors: offloaded 15/33 layers to GPU .................................................................................................... Automatic RoPE Scaling: Using model internal value. llama_new_context_with_model: n_ctx = 32848 llama_new_context_with_model: freq_base = 1000000.0 llama_new_context_with_model: freq_scale = 1 llama_new_context_with_model: KV self size = 4106.00 MiB, K (f16): 2053.00 MiB, V (f16): 2053.00 MiB llama_build_graph: non-view tensors processed: 1124/1124 llama_new_context_with_model: compute buffer total size = 2172.38 MiB llama_new_context_with_model: VRAM scratch buffer: 2169.19 MiB llama_new_context_with_model: total VRAM used: 19229.35 MiB (model: 17060.16 MiB, context: 2169.19 MiB) Load Model OK: True Embedded Kobold Lite loaded. Starting Kobold API on port 5001 at http://localhost:5001/api/ Starting OpenAI Compatible API on port 5001 at http://localhost:5001/v1/

Please connect to custom endpoint at http://localhost:5001

LostRuins commented 9 months ago

Can you check if this is now fixed in 1.56? The backend was reworked.

SabinStargem commented 9 months ago

Not sure if I can. I replaced one of the 3060s with a 4090, so I can't replicate my setup. It is difficult to nail down how much space the layers and KV cache requires for each card, so that I can deliver a legit report.

Here are my results for the failed bootup. I need to leave enough memory on the 3060 for my browser, gaming, and Windows, so I can only use about 8ish gigs for AI. I think there is enough memory left on both cards for the failed 30-layer bootup, but I am not sure. I reduced layer count until I got a successful boot, at 26 layers. I tried again with 38 layers, at 7-3 split in Low VRAM mode. It is much easier to understand the VRAM taken in that mode, and successfully booted.

FAILED BOOTUP


Welcome to KoboldCpp - Version 1.56 For command line arguments, please refer to --help


Attempting to use CuBLAS library for faster prompt ingestion. A compatible CuBLAS will be required. Initializing dynamic library: koboldcpp_cublas.dll

Namespace(bantokens=None, blasbatchsize=512, blasthreads=31, config=None, contextsize=32768, debugmode=0, forceversion=0, foreground=False, gpulayers=30, highpriority=False, hordeconfig=None, host='', launch=True, lora=None, model=None, model_param='C:/KoboldCPP/Models/bagel-hermes-2x34b.Q6_K.gguf', multiuser=1, noavx2=False, noblas=False, nommap=False, noshift=False, onready='', port=5001, port_param=5001, preloadstory=None, quiet=False, remotetunnel=False, ropeconfig=[0.0, 10000.0], skiplauncher=False, smartcontext=False, ssl=None, tensor_split=[8.0, 2.0], threads=31, useclblast=None, usecublas=['normal', 'mmq'], usemlock=True, usevulkan=None)

Loading model: C:\KoboldCPP\Models\bagel-hermes-2x34b.Q6_K.gguf [Threads: 31, BlasThreads: 31, SmartContext: False, ContextShift: True]

The reported GGUF Arch is: llama


Identified as GGUF model: (ver 6) Attempting to Load...

Using automatic RoPE scaling. If the model has customized RoPE settings, they will be used directly instead! System Info: AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 0 | VSX = 0 | ggml_init_cublas: found 2 CUDA devices: Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes Device 1: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes llama_model_loader: loaded meta data with 26 key-value pairs and 783 tensors from C:\KoboldCPP\Models\bagel-hermes-2x34b.Q6_K.ggホllm_load_vocab: mismatch in special tokens definition ( 498/64000 vs 267/64000 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 64000 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 200000 llm_load_print_meta: n_embd = 7168 llm_load_print_meta: n_head = 56 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_layer = 60 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 7 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: n_ff = 20480 llm_load_print_meta: n_expert = 2 llm_load_print_meta: n_expert_used = 2 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 5000000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 200000 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: model type = 30B llm_load_print_meta: model ftype = unknown, may not work llm_load_print_meta: model params = 60.81 B llm_load_print_meta: model size = 46.46 GiB (6.56 BPW) llm_load_print_meta: general.name = weyaxi_bagel-hermes-2x34b llm_load_print_meta: BOS token = 1 '<|startoftext|>' llm_load_print_meta: EOS token = 2 '<|endoftext|>' llm_load_print_meta: UNK token = 0 '' llm_load_print_meta: PAD token = 1 '<|startoftext|>' llm_load_print_meta: LF token = 315 '<0x0A>' llm_load_tensors: ggml ctx size = 0.90 MiB llm_load_tensors: offloading 30 repeating layers to GPU llm_load_tensors: offloaded 30/61 layers to GPU llm_load_tensors: CPU buffer size = 47578.97 MiB llm_load_tensors: CUDA0 buffer size = 18744.47 MiB llm_load_tensors: CUDA1 buffer size = 4686.12 MiB .................................................................................................... Automatic RoPE Scaling: Using model internal value. llama_new_context_with_model: n_ctx = 32848 llama_new_context_with_model: freq_base = 5000000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA_Host KV buffer size = 3849.38 MiB llama_kv_cache_init: CUDA0 KV buffer size = 3079.50 MiB llama_kv_cache_init: CUDA1 KV buffer size = 769.88 MiB llama_new_context_with_model: KV self size = 7698.75 MiB, K (f16): 3849.38 MiB, V (f16): 3849.38 MiB llama_new_context_with_model: CUDA_Host input buffer size = 78.29 MiB ggml_backend_cuda_buffer_type_alloc_buffer: allocating 4128.20 MiB on device 0: cudaMalloc failed: out of memory Traceback (most recent call last): File "koboldcpp.py", line 2580, in File "koboldcpp.py", line 2426, in main File "koboldcpp.py", line 328, in load_model OSError: exception: access violation reading 0x0000000000000010 [31412] Failed to execute script 'koboldcpp' due to unhandled exception!

[process exited with code 1 (0x00000001)]

LostRuins commented 9 months ago

Okay one more thing you can try is swapping the maingpu.

When you use the --usecublas flag, these are the args for it [lowvram|normal] [main GPU ID] [mmq] So to use second GPU as the main one instead with a 80:20 split, use

--usecublas normal 1 mmq --tensor_split 8.0 2.0

Some people have reported using a different main GPU helps.