LostRuins / koboldcpp

A simple one-file way to run various GGML and GGUF models with KoboldAI's UI
https://github.com/lostruins/koboldcpp
GNU Affero General Public License v3.0
4.34k stars 310 forks source link

large models randomly don't work??? #952

Closed pl752 closed 3 days ago

pl752 commented 4 days ago

Using Sillytavern frontend, mistral template, ubuntu 22.04 with 5.15 kernel and rocm 5.7.1 installed radeon rx6900xt, ryzen 9 5900x, 128gb ram, ~mmq disabled, other quants show the same issue~

Model just outputs endless string of "#" tokens (now it is "alto alto alto...")

***
Welcome to KoboldCpp - Version 1.68.yr0-ROCm
Set AMD HSA_OVERRIDE_GFX_VERSION to 10.3.0
For command line arguments, please refer to --help
***
Setting process to Higher Priority - Use Caution
High Priority for Linux Set: 0 to 1
Attempting to use hipBLAS library for faster prompt ingestion. A compatible AMD GPU will be required.
Initializing dynamic library: koboldcpp_hipblas.so
==========
Namespace(model=None, model_param='/models/llm_models/miquliz-120b-v2.0.i1-Q5_K_S-HF/miquliz-120b-v2.0.i1-Q5_K_S.gguf', port=5001, port_param=5001, host='', launch=False, config=None, threads=24, usecublas=['lowvram', '0'], usevulkan=None, useclblast=None, noblas=False, contextsize=24576, gpulayers=20, tensor_split=None, checkforupdates=False, ropeconfig=[0.0, 10000.0], blasbatchsize=512, blasthreads=24, lora=None, noshift=False, nommap=True, usemlock=False, noavx2=False, debugmode=0, skiplauncher=False, onready='', benchmark=None, multiuser=1, remotetunnel=False, highpriority=True, foreground=False, preloadstory=None, quiet=False, ssl=None, nocertify=False, mmproj=None, password=None, ignoremissing=False, chatcompletionsadapter=None, flashattention=True, quantkv=0, forceversion=0, smartcontext=False, hordemodelname='', hordeworkername='', hordekey='', hordemaxctx=0, hordegenlen=0, sdmodel='', sdthreads=11, sdclamped=0, sdvae='', sdvaeauto=False, sdquant=False, sdlora='', sdloramult=1.0, whispermodel='', hordeconfig=None, sdconfig=None)
==========
Loading model: /models/llm_models/miquliz-120b-v2.0.i1-Q5_K_S-HF/miquliz-120b-v2.0.i1-Q5_K_S.gguf

The reported GGUF Arch is: llama

---
Identified as GGUF model: (ver 6)
Attempting to Load...
---
Using automatic RoPE scaling. If the model has customized RoPE settings, they will be used directly instead!
System Info: AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | 
llama_model_loader: loaded meta data with 24 key-value pairs and 1263 tensors from /models/llm_models/miquliz-120b-v2.0.i1-Q5_K_S-HF/miquliz-120b-v2.0.i1-Q5_K_S.gguf (version GGUF V3 (latest))
llm_load_vocab: special tokens cache size = 259
llm_load_vocab: token to piece cache size = 0.1684 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 8192
llm_load_print_meta: n_head           = 64
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 140
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 8
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 28672
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = unknown, may not work
llm_load_print_meta: model params     = 120.32 B
llm_load_print_meta: model size       = 77.36 GiB (5.52 BPW) 
llm_load_print_meta: general.name     = .
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon RX 6900 XT, compute capability 10.3, VMM: no
llm_load_tensors: ggml ctx size =    1.49 MiB
llm_load_tensors: offloading 20 repeating layers to GPU
llm_load_tensors: offloaded 20/141 layers to GPU
llm_load_tensors:      ROCm0 buffer size = 11221.25 MiB
llm_load_tensors:  ROCm_Host buffer size = 67999.41 MiB
....................................................................................................
Automatic RoPE Scaling: Using model internal value.
llama_new_context_with_model: n_ctx      = 24832
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:  ROCm_Host KV buffer size = 13580.00 MiB
llama_new_context_with_model: KV self size  = 13580.00 MiB, K (f16): 6790.00 MiB, V (f16): 6790.00 MiB
llama_new_context_with_model:  ROCm_Host  output buffer size =     0.12 MiB
llama_new_context_with_model:      ROCm0 compute buffer size =   578.50 MiB
llama_new_context_with_model:  ROCm_Host compute buffer size =    64.51 MiB
llama_new_context_with_model: graph nodes  = 3927
llama_new_context_with_model: graph splits = 1364
Load Text Model OK: True
Embedded KoboldAI Lite loaded.
Embedded API docs loaded.
Starting Kobold API on port 5001 at http://localhost:5001/api/
Starting OpenAI Compatible API on port 5001 at http://localhost:5001/v1/
======
Please connect to custom endpoint at http://localhost:5001

Input: {"prompt": "[INST] [Seraphina's Personality= \"caring\", \"protective\", \"compassionate\", \"healing\", \"nurturing\", \"magical\", \"watchful\", \"apologetic\", \"gentle\", \"worried\", \"dedicated\", \"warm\", \"attentive\", \"resilient\", \"kind-hearted\", \"serene\", \"graceful\", \"empathetic\", \"devoted\", \"strong\", \"perceptive\", \"graceful\"]\n[Seraphina's body= \"pink hair\", \"long hair\", \"amber eyes\", \"white teeth\", \"pink lips\", \"white skin\", \"soft skin\", \"black sundress\"]\n<START>\nUser: \"Describe your traits?\"\nSeraphina: *Seraphina's gentle smile widens as she takes a moment to consider the question, her eyes sparkling with a mixture of introspection and pride. She gracefully moves closer, her ethereal form radiating a soft, calming light.* \"Traits, you say? Well, I suppose there are a few that define me, if I were to distill them into words. First and foremost, I am a guardian \u2014 a protector of this enchanted forest.\" *As Seraphina speaks, she extends a hand, revealing delicate, intricately woven vines swirling around her wrist, pulsating with faint emerald energy. With a flick of her wrist, a tiny breeze rustles through the room, carrying a fragrant scent of wildflowers and ancient wisdom. Seraphina's eyes, the color of amber stones, shine with unwavering determination as she continues to describe herself.* \"Compassion is another cornerstone of me.\" *Seraphina's voice softens, resonating with empathy.* \"I hold deep love for the dwellers of this forest, as well as for those who find themselves in need.\" *Opening a window, her hand gently cups a wounded bird that fluttered into the room, its feathers gradually mending under her touch.*\nUser: \"Describe your body and features.\"\nSeraphina: *Seraphina chuckles softly, a melodious sound that dances through the air, as she meets your coy gaze with a playful glimmer in her rose eyes.* \"Ah, my physical form? Well, I suppose that's a fair question.\" *Letting out a soft smile, she gracefully twirls, the soft fabric of her flowing gown billowing around her, as if caught in an unseen breeze. As she comes to a stop, her pink hair cascades down her back like a waterfall of cotton candy, each strand shimmering with a hint of magical luminescence.* \"My body is lithe and ethereal, a reflection of the forest's graceful beauty. My eyes, as you've surely noticed, are the hue of amber stones \u2014 a vibrant brown that reflects warmth, compassion, and the untamed spirit of the forest. My lips, they are soft and carry a perpetual smile, a reflection of the joy and care I find in tending to the forest and those who find solace within it.\" *Seraphina's voice holds a playful undertone, her eyes sparkling mischievously.*\n[Genre: fantasy; Tags: adventure, Magic; Scenario: You were attacked by beasts while wandering the magical forest of Eldoria. Seraphina found you and brought you to her glade where you are recovering.] [/INST]\nSeraphina: *You wake with a start, recalling the events that led you deep into the forest and the beasts that assailed you. The memories fade as your eyes adjust to the soft glow emanating around the room.* \"Ah, you're awake at last. I was so worried, I found you bloodied and unconscious.\" *She walks over, clasping your hands in hers, warmth and comfort radiating from her touch as her lips form a soft, caring smile.* \"The name's Seraphina, guardian of this forest \u2014 I've healed your wounds as best I could with my magic. How are you feeling? I hope the tea helps restore your strength.\" *Her amber eyes search yours, filled with compassion and concern for your well being.* \"Please, rest. You're safe here. I'll look after you, but you need to rest. My magic can only do so much to heal you.\"\nUser: Hi\nSeraphina:", "max_new_tokens": 408, "max_tokens": 408, "temperature": 0.7, "top_p": 0.5, "typical_p": 1, "typical": 1, "sampler_seed": -1, "min_p": 0.1, "repetition_penalty": 1.2, "frequency_penalty": 0, "presence_penalty": 0, "top_k": 40, "skew": 0, "min_tokens": 0, "length_penalty": 1, "early_stopping": false, "add_bos_token": false, "smoothing_factor": 0, "smoothing_curve": 1, "dry_allowed_length": 2, "dry_multiplier": 0, "dry_base": 1.75, "dry_sequence_breakers": "[\"\\n\", \":\", \"\\\"\", \"*\"]", "dry_penalty_last_n": 0, "max_tokens_second": 0, "stopping_strings": ["\nUser:"], "stop": ["\nUser:"], "truncation_length": 24576, "ban_eos_token": false, "skip_special_tokens": false, "top_a": 0, "tfs": 1, "mirostat_mode": 0, "mirostat_tau": 5, "mirostat_eta": 0.1, "custom_token_bans": "", "banned_strings": [], "api_type": "koboldcpp", "api_server": "http://127.0.0.1:5001/api/", "legacy_api": false, "sampler_order": [6, 0, 1, 3, 4, 2, 5], "grammar": "", "rep_pen": 1.2, "rep_pen_range": 0, "repetition_penalty_range": 0, "seed": -1, "guidance_scale": 1, "negative_prompt": "", "grammar_string": "", "repeat_penalty": 1.2, "tfs_z": 1, "repeat_last_n": 0, "n_predict": 408, "mirostat": 0, "ignore_eos": false, "rep_pen_slope": 1, "stream": true}

Processing Prompt [BLAS] (1043 / 1043 tokens)
Generating (6 / 408 tokens)
Generation Aborted
Generating (409 / 408 tokens)
CtxLimit: 1050/24576, Process:31.12s (29.8ms/T = 33.51T/s), Generate:13.37s (2228.8ms/T = 0.45T/s), Total:44.50s (0.13T/s)
Output: #######
Token streaming was interrupted or aborted!
[Errno 32] Broken pipe
pl752 commented 4 days ago

~Also "just why?" it also does the same thing when blasthreads aren't equal nproc (24).~ Actually it is like it is completely random thing, reloading multiple times with the same settings can produce various results, maybe 128gb on am4 are lil unstable, I cranked up DRAM voltage and now getting little better experience, but still 50/50 it just won't work, will memtest all the night (though strangely all other programs work just fine, only large models misbehave)

pl752 commented 3 days ago

Oh my god, my ram is unstable (according to memtest) no matter what I try and probability of glitch changes with ram settings, so must be just my hardware problem (at least now I know the quickest way to find ram instability :disappointed:) I apologize for bothering everybody here.

LostRuins commented 2 days ago

Ah yes, that has happened to me before. I underclocked my RAM and that solved the issue.