LostRuins / koboldcpp

Run GGUF models easily with a KoboldAI UI. One File. Zero Install.
https://github.com/lostruins/koboldcpp
GNU Affero General Public License v3.0
4.81k stars 343 forks source link

[Issue] Loading any model on 1.32 crashes saying I'm out of VRAM #255

Closed bartowski1182 closed 1 year ago

bartowski1182 commented 1 year ago

Prerequisites

Please answer the following questions for yourself before submitting an issue.

Expected Behavior

Loading a 7B q2_K model should easily fit all layers into my 3060 with 12GB of VRAM

Current Behavior

Even loading with only 20 layers of 7B q2_K offloaded results in an error saying i'm out of VRAM

Environment and Context

Ryzen 3600 64GB ram, 3060 12GB RAM

Steps to Reproduce

Please provide detailed steps for reproducing the issue. We are not sitting in front of your screen, so the more detail the better.

  1. Load with parameters --model vicuna-7b-v1.3.ggmlv3.q2_K.bin --threads 1 --useclblast 0 0 --gpulayers 20
  2. Check logs as it attempts to load and fins may be out of VRAM

Failure Logs

koboldcpp    | llama.cpp: loading model from /app/models/vicuna-7b-v1.3.ggmlv3.q2_K.bin
koboldcpp    | llama_model_load_internal: format     = ggjt v3 (latest)
koboldcpp    | llama_model_load_internal: n_vocab    = 32000
koboldcpp    | llama_model_load_internal: n_ctx      = 2048
koboldcpp    | llama_model_load_internal: n_embd     = 4096
koboldcpp    | llama_model_load_internal: n_mult     = 256
koboldcpp    | llama_model_load_internal: n_head     = 32
koboldcpp    | llama_model_load_internal: n_layer    = 32
koboldcpp    | llama_model_load_internal: n_rot      = 128
koboldcpp    | llama_model_load_internal: ftype      = 10 (mostly Q2_K)
koboldcpp    | llama_model_load_internal: n_ff       = 11008
koboldcpp    | llama_model_load_internal: n_parts    = 1
koboldcpp    | llama_model_load_internal: model size = 7B
koboldcpp    | llama_model_load_internal: ggml ctx size =    0.07 MB
koboldcpp    | ggml_opencl: clGetPlatformIDs(NPLAT, platform_ids, &n_platforms) error -1001 at ggml-opencl.cpp:787
koboldcpp    | You may be out of VRAM. Please check if you have enough.
koboldcpp    |
koboldcpp    | ---
koboldcpp    | Identified as LLAMA model: (ver 5)
koboldcpp    | Attempting to Load...
koboldcpp    | ---
koboldcpp    | System Info: AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
LostRuins commented 1 year ago

Are you running the exe? Building from source (did you run make clean)? Windows or linux?

bartowski1182 commented 1 year ago

Good questions, building from source, clean make, tried with both make LLAMA_CUBLAS=1 and make LLAMA_OPENBLAS=1 LLAMA_CLBLAST=1, Ubuntu 22.04

It's not even attempting to load anything onto the GPU, if i watch it with nvidia-smi it stays at 1mb

bartowski1182 commented 1 year ago

sigh ignore me, i was using the wrong image.. spent a solid 30 minutes debugging a typo, love it