LostRuins / koboldcpp

Run GGUF models easily with a KoboldAI UI. One File. Zero Install.
https://github.com/lostruins/koboldcpp
GNU Affero General Public License v3.0
5.35k stars 365 forks source link

Super long startup times? #592

Open cspenn opened 11 months ago

cspenn commented 11 months ago

Since 1.52, Kobold seems to take substantially longer to start up - on the order of 10x the previous startup times.

MacOS Sonoma, currently on KoboldCpp 1.53. Here's what it shows at startup:

Welcome to KoboldCpp - Version 1.53
Warning: OpenBLAS library file not found. Non-BLAS library will be used.
Initializing dynamic library: koboldcpp_default.so
==========
Namespace(model=None, model_param='/Volumes/SSD/models/dolphin_2_6_mixtral_8x7b_q5_k_m.gguf', port=5001, port_param=6969, host='', launch=False, lora=None, config=None, threads=4, blasthreads=4, highpriority=False, contextsize=32768, blasbatchsize=512, ropeconfig=[0.0, 10000.0], smartcontext=False, noshift=False, bantokens=None, forceversion=0, nommap=False, usemlock=False, noavx2=False, debugmode=0, skiplauncher=False, hordeconfig=None, noblas=False, useclblast=None, usecublas=None, gpulayers=128, tensor_split=None, onready='', multiuser=0, remotetunnel=False, foreground=False, preloadstory='', quiet=False, ssl=None)
==========
Loading model: /Volumes/SSD/models/dolphin_2_6_mixtral_8x7b_q5_k_m.gguf 
[Threads: 4, BlasThreads: 4, SmartContext: False, ContextShift: True]

The reported GGUF Arch is: llama

---
Identified as LLAMA model: (ver 6)
Attempting to Load...
---
Using automatic RoPE scaling. If the model has customized RoPE settings, they will be used directly instead!
System Info: AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | 
llama_model_loader: loaded meta data with 25 key-value pairs and 995 tensors from /Volumes/SSD/models/dolphin_2_6_mixtral_8x7b_q5_k_m.gguf (version GGUF V3 (latest))
llm_load_vocab: special tokens definition check successful ( 261/32002 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32002
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 8
llm_load_print_meta: n_expert_used    = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = all F32
llm_load_print_meta: model params     = 46.70 B
llm_load_print_meta: model size       = 30.02 GiB (5.52 BPW) 
llm_load_print_meta: general.name     = cognitivecomputations_dolphin-2.6-mixtral-8x7b
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 32000 '<|im_end|>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size       =    0.38 MiB
llm_load_tensors: system memory used  = 30735.88 MiB
....................................................................................................
Automatic RoPE Scaling: Using model internal value.
llama_new_context_with_model: n_ctx      = 32848
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: KV self size  = 4106.00 MiB, K (f16): 2053.00 MiB, V (f16): 2053.00 MiB
llama_build_graph: non-view tensors processed: 1124/1124
llama_new_context_with_model: compute buffer total size = 2172.38 MiB
Load Model OK: True
Embedded Kobold Lite loaded.
Starting Kobold API on port 6969 at http://localhost:6969/api/

Here's the launch command, issued through a zsh script:

if [[ -n $selected_model ]]; then python3 koboldcpp.py "$selected_model" 6969 --gpulayers 128 --contextsize $context_size

How would I troubleshoot what's changed and why it takes so long to start up now?

LostRuins commented 11 months ago

Everything seems fine, although you're probably not using Metal (I think, did you build with LLAMA_METAL=1)? But accelerate is pretty fast too.

Startup times depend on having the model weights transferred into RAM or VRAM. My guess is that during earlier versions the weights were already cached in memory and thus able to be loaded very quickly. Did you measure the speed the second time you open KoboldCpp? It should load substantially faster on second open. Perhaps the previous time you tried, you just downloaded the model.

cspenn commented 11 months ago

I did build with Metal, yes. I'll repull just to be sure. I have a script I use to swap models in and out, so the weights don't stay in RAM very long.

cspenn commented 11 months ago

Huh. What changed in 1.54? It's back to its super fast load times- Nete 13B which was not in memory loaded in seconds. Mixtral used to take close to 5 minutes to load up under 1.53, was ready to go in 25 seconds.

LostRuins commented 11 months ago

That's good, but I don't think anything has changed. Like I mentioned, I believe your issue with 1.53 was due to some other bottleneck with the way the weights are stored/loaded in your system. I think if you test 1.53 again after loading in 1.54, it will also load just as quickly.