LostRuins / koboldcpp

Run GGUF models easily with a KoboldAI UI. One File. Zero Install.
https://github.com/lostruins/koboldcpp
GNU Affero General Public License v3.0
5k stars 349 forks source link

The seed is not randomized? #164

Closed Vladonai closed 1 year ago

Vladonai commented 1 year ago

I have noticed that models give the same answers with the same prompt. It seems as if the seed is not randomized.

LostRuins commented 1 year ago

The seed is randomized based on the system clock. Maybe your temperature is too low, or your top K is 1. You can confirm this by regenerating the response and you should get different results.

josefcub commented 1 year ago

I've noticed this too. My top_k is 0, according to the Settings UI.

I figured it was just me, that I'd misadjusted something that brought the creativity way down, but it does look on first glance like the same seed was used both times. I kind of miss the wild swings in tone and prose when I'd click 'Retry'.

Edit: Realizing that my memories and settings could be getting in the way, I clicked New Game, then set the preset purely to 'Godlike' in settings, and ran the experiment again. Here's the results:

Describe a simple wooden box, please.

Sure! A simple wooden box can be made of various types of wood such as oak, pine or cherry. It usually has four straight sides and a flat bottom, and its dimensions are usually small enough to fit easily on a shelf or in a closet. The top of the box is typically hinged and can be lifted up to access whatever items are stored inside.

Clicking 'Retry' nets me:

Sure! A simple wooden box can be made of various types of wood such as oak, pine or cherry. It usually has four straight sides and a flat bottom, and its dimensions are usually small enough to fit easily on a shelf or in a closet. The top of the box is typically hinged and can be lifted up to access whatever items are stored inside.

Which is close to word-for-word identical with the original attempt. Here's the settings after 'New Game' and selecting the Godlike preset:

image

My command line looks like this:

$ python3 ./koboldcpp.py --threads 12 --stream --host=0.0.0.0 --port 6001 --highpriority --smartcontext ./models.new/Wizard-Vicuna-13B-Uncensored.new.ggml.q5_1.bin

The version I'm running is pulled as of early this morning, though this has been going on for me for a few versions now.

josefcub commented 1 year ago

Looking a little more comprehensively, I once again hit 'New Game', and this time put the setting to [Default].

Describe a simple wooden box, please.

Sure! A simple wooden box can be made of various types of wood such as oak, pine or cedar. It usually has four sides, a top and a bottom, and is hinged at one end to open and close.

and on Retry:

Sure! A simple wooden box can be made of various types of wood such as oak, pine or cedar. It usually has four sides, a top and a bottom, and is typically small enough to fit in your hand or on a tabletop.

and in the log output:

Input: {"n": 1, "max_context_length": 2048, "max_length": 8, "rep_pen": 1.08, "temperature": 0.62, "top_p": 0.9, "top_k": 0, "top_a": 0, "typical": 1, "tfs": 1, "rep_pen_range": 1024, "rep_pen_slope": 0.7, "sampler_order": [0, 1, 2, 3, 4, 5, 6], "prompt": "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n\n\n### Instruction:\n\nDescribe a simple wooden box, please.\n\n### Response:\n\n", "quiet": true, "stop_sequence": ["\n### Instruction:", "\n### Response:"]}

and

Input: {"n": 1, "max_context_length": 2048, "max_length": 8, "rep_pen": 1.08, "temperature": 0.62, "top_p": 0.9, "top_k": 0, "top_a": 0, "typical": 1, "tfs": 1, "rep_pen_range": 1024, "rep_pen_slope": 0.7, "sampler_order": [0, 1, 2, 3, 4, 5, 6], "prompt": "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n\n\n### Instruction:\n\nDescribe a simple wooden box, please.\n\n### Response:\n\n", "quiet": true, "stop_sequence": ["\n### Instruction:", "\n### Response:"]}

Processing Prompt (1 / 1 tokens)

Now in this case, using default settings, we get a significant divergence at the end of the generation. Am I missing something obvious, or is there something about presets (or just Godlike) that is causing near-identical output when you regenerate?

Vladonai commented 1 year ago

I moved the temperature slider and changed the value of top_K and the problem went away (not immediately). The answers to the same Prompt are now different. But the problem was definitely there. Apparently there is some bug.

LostRuins commented 1 year ago

I think some models are also inherently less random than others. If you try with the base llama model and default settings, I have no issues with the output changing when I retry. I've heard that Wizard based models are less random. Please try with a different model.

noprotocolunit commented 1 year ago

A different llama.cpp implementation may be seeing a similar issue: https://github.com/abetlen/llama-cpp-python/discussions/210

WolframRavenwolf commented 1 year ago

I'm seeing the same problem with koboldcpp-1.22-CUDA-ONLY and TheBloke_WizardLM-7B-uncensored-GGML/WizardLM-7B-uncensored.ggml.q5_1 and TheBloke_wizard-mega-13B-GGML.q5_1. No idea if it's a koboldcpp problem, the Wizard models or the quantization, but it definitely isn't randomized enough. That's why I looked around if it's a bug or something and found this issue, so I thought I'd add my observations.

LostRuins commented 1 year ago

One way to force more randomness is to use a very high temperature and increase top P

WolframRavenwolf commented 1 year ago

One way to force more randomness is to use a very high temperature and increase top P

Maybe that's the reason (or a part thereof)? I've recently started using --usemirostat 2 5.0 0.1, perhaps that's manipulating generation parameters in a way that reduces/eliminates randomness.

Combined with models like Wizard that according to gist74's link have inherently less randomness, together that could explain why regenerations produce the same output.

So many variables that changed, though, with new models, quantizations and hardware acceleration all happening at once.

JHawkley commented 1 year ago

I've recently started using --usemirostat 2 5.0 0.1, perhaps that's manipulating generation parameters in a way that reduces/eliminates randomness.

You are correct. Although temperature is supposedly used as a seed value, mirostat very quickly converges on an optimal temperature value for the current context and is largely useless to affect the output.

From my tests, I found that mirostat doesn't eliminate randomness, however it does become much less random. Some models I tested with would output the same thing with mirostat every time while others would generate variations of the same overall idea and were more likely to become divergent the further along they got.

My gut feeling is that how little randomness you observe might correlate with overfitting in the model's finetune/LoRA. Might be a nice smoketest for that.

Anyways, the trick I found to force starkly different generations with models that fall into a "mirostat rut" was to tweak the repetition penalty parameters, particularly the penalty and the slope values. I'll often swing between 1.05/7 and 1.25/11 when I'm not happy with the output and retries are not different enough. The repetition penalty sampling is hardcoded to happen before mirostat sampling, so it has a huge effect on it.

WolframRavenwolf commented 1 year ago

Thank you for the insight! I agree with your observation about overfitting, too, which correlates with the aforementioned issue of e. g. WizardLM (Possible sign of overtraining).

Another thing I noticed is that little differences in any part of the input have huge influence on the output. Even a simple change like in Write {{char}}'s next reply in a fictional chat between {{char}} and {{user}} changing {{char}} and {{user}} to {{user}} and {{char}} can turn a coherent multi-paragraph response into a garbage one where it just repeats a word over and over.

Which has been frustrating because I've spent a lot of time evaluating models, like others have, and I now wonder if that's futile because such little changes can swing the results completely. I'll investigate if mirostat has an impact on that, too.

JHawkley commented 1 year ago

I've personally found mirostat to be surprisingly resistant to small changes affecting the output in a very striking way. I've tried things like replacing words with synonyms and it still outputs roughly the same follow-up sentence.

Perhaps it might depend on where you make the change though. I wasn't doing it in the instruction portion of the context like that.

Mirostat is very weird. I like it and I think it is a better sampler in general, but I would really like to have it respond to the seed more, or otherwise have a better means to vary the output. Maybe it could be as simple as just having a scalar parameter that controls how much it munges the sampled token weights (applies a random offset) in a seeded manner of something. I dunno.

WolframRavenwolf commented 1 year ago

I've since switched to 33B models because they seem both more intelligent than 13B (and of course 7B) and also more resistant to randomness. With 13B a change in the instructions as shown above would produce completely different outputs even with Mirostat, sometimes ignoring the instructions completely, but the 33B models I tested kept adhering to the instructions.

It's been such a night and day difference that I've completely switched to 33B (with Mirostat) even on my puny laptop with just 8 GB VRAM. Still produces (barely) acceptable performance with GPU layers and streaming while offering more intelligence and less randomness than smaller models.

With the lower number of 33B models and longer times to benchmark, this is still fairly anecdotal, though. So others' findings are appreciated to see if that's truly a general trend with model sizes.

LostRuins commented 1 year ago

Closing this issue for now. Note that there are a few enhancements future users can enjoy in the latest version.

Using --debugmode allows you to see token probabilities, revealing just how random the model is based on your settings.

Additionally, the api now supports setting seed via sampler_seed so deterministic tests are possible.

aleksusklim commented 1 year ago

For me, mirostat (at 2/5.0/0.1 and even 2/10.0/0.1) always chooses the same tokens that are chosen at temp=0.1 + top_k=1 without mirostat. It looks like it drops temperature to zero and that's all what it does.

With --debugmode I see low probabilities (around 7% for single-token words) and yet they are the same tokens as would be used with 100% probabilities with top_k=1. Why? Is this works really as expected? (So the problem is not only "mirostat always gives the same output", but "mirostat de-facto makes zero temperature", which is worse, because why to use it in the first place, then?)

I tried different models and settings. Even temp=2 cannot fix mirostat: it lowers % in debug output, but the same tokens are used regardless. @LostRuins, are use sure this is not a bug? Maybe something is locking actual probabilities? My samplers order is 6,0,1,3,4,2,5

aleksusklim commented 1 year ago

Related: https://github.com/LostRuins/koboldcpp/pull/338 I'll check after the next release.

aleksusklim commented 1 year ago

@LostRuins, I've enabled mirostat2 together with extended context length and all my usual settings.

Indeed, the generation is different, and not the same every time! But…

How can I make sure that mirostat is actually used? I mean, if I lower temperature to 0.1 – it starts to output roughly the same text. If I'll raise it to 2 – well, it becomes chaotic. If I'll set top_p = 0, I got again exactly the same text. Just as with top_k = 1

This is just as without mirostat. Isn't it supposed to ignore top_p and top_k sampler parameters?

LostRuins commented 1 year ago

I don't know how you would accurately differentiate mirostat, but you can try comparing token probabilities with --debugmode on.

aleksusklim commented 1 year ago

Thank you, I made some tests. For example, with pygmalion-13b-ggml-q5_1.bin for this prompt:

You are an interface to Wikipedia. You will process user input to reply with a short description of the subject that he asked.
Please return only correct and concise texts!
User: Who is Albert Einstein?
Wikipedia:

Context = 4096, temperature = 0.85, top_p = 1.0, top_k = 3, penalty = 1.1

mirostat normal
Albert 64.40% | " 25.35% | Ein 10.25% Albert 64.40% | " 25.35% | Ein 10.25%
Ein 99.17% | Abraham 0.47% | was 0.37% Ein 99.17% | Abraham 0.47% | was 0.37%
stein 99.98% | ste 0.02% | sein 0.00% stein 99.98% | ste 0.02% | sein 0.00%
was 78.10% | ( 20.79% | , 1.12% was 78.10% | ( 20.79% | , 1.12%
a 80.91% | born 16.71% | an 2.39% a 80.91% | born 16.71% | an 2.39%
German 81.23% | theoretical 13.47% | phys 5.30% German 81.23% | theoretical 13.47% | phys 5.30%
- 81.41% | phys 11.26% | theoretical 7.32% - 81.41% | phys 11.26% | theoretical 7.32%
born 99.23% | Sw 0.54% | American 0.23% born 99.23% | Sw 0.54% | American 0.23%
theoretical 75.90% | phys 22.45% | scient 1.65% theoretical 75.90% | phys 22.45% | scient 1.65%
phys 99.99% | Phys 0.01% | physics 0.01% phys 99.99% | Phys 0.01% | physics 0.01%
ic 100.00% | ician 0.00% | c 0.00% ic 100.00% | ician 0.00% | c 0.00%
ist 99.97% | ists 0.03% | is 0.00% ist 99.97% | ists 0.03% | is 0.00%
who 96.23% | , 2.30% | . 1.47% who 96.23% | , 2.30% | . 1.47%
developed 99.76% | is 0.19% | created 0.05% developed 99.76% | is 0.19% | created 0.05%
the 99.82% | one 0.09% | his 0.08% the 99.82% | one 0.09% | his 0.08%
general 21.20% | special 45.65% | theory 33.14% general 21.20% | special 45.65% | theory 33.14%
theory 99.99% | Theory 0.01% | relativ 0.00% theory 99.99% | Theory 0.01% | relativ 0.00%
of 99.99% | for 0.01% | relativ 0.01% of 99.99% | for 0.01% | relativ 0.01%
relativ 99.99% | Rel 0.01% | relative 0.00% relativ 99.99% | Rel 0.01% | relative 0.00%
ity 100.00% | ty 0.00% | i 0.00% ity 100.00% | ty 0.00% | i 0.00%
, 85.51% | . 13.67% | and 0.82% , 85.51% | . 13.67% | and 0.82%
one 99.46% | among 0.42% | which 0.12% one 99.46% | among 0.42% | which 0.12%
of 100.00% | 0.00% | [ 0.00% of 100.00% | 0.00% | [ 0.00%
the 90.18% | two 9.69% | history 0.14% the 90.18% | two 9.69% | history 0.14%
two 99.94% | most 0.04% | Two 0.01% two 99.94% | most 0.04% | Two 0.01%
pill 99.24% | p 0.75% | most 0.01% pill 99.24% | p 0.75% | most 0.01%
ars 100.00% | ers 0.00% | ar 0.00% ars 100.00% | ers 0.00% | ar 0.00%
of 99.99% | in 0.01% | ( 0.00% of 99.99% | in 0.01% | ( 0.00%
modern 100.00% | contemporary 0.00% | Modern 0.00% modern 100.00% | contemporary 0.00% | Modern 0.00%
physics 99.99% | science 0.01% | physics 0.00% physics 99.99% | science 0.01% | physics 0.00%
[DIFFERENCE FROM HERE] ( 21.82% | . 46.68% | alongside 31.50%
alongside 31.50% | . 46.68% | ( 21.82% al 99.79% | the 0.13% | t 0.08%
quantum 98.77% | Isaac 1.17% | Newton 0.06% ong 99.94% | ongs 0.06% | ONG 0.00%
mechan 100.00% | Mechan 0.00% | mechanical 0.00% side 100.00% | with 0.00% | side 0.00%
ics 100.00% | ic 0.00% | isms 0.00% quantum 99.94% | quant 0.04% | special 0.02%
. 97.79% | ( 1.14% | , 1.07% mechan 100.00% | Mechan 0.00% | mechanical 0.00%
Ein 30.40% | 40.64% | His 28.96% ics 100.00% | ic 0.00% | icks 0.00%
ste 0.36% | stein 99.62% | 0.01% ). 99.74% | ), 0.13% | ) 0.13%
ins 99.80% | i 0.10% | int 0.10% His 38.09% | 47.01% | Ein 14.89%
work 99.97% | paper 0.02% | achiev 0.01% work 100.00% | achiev 0.00% | works 0.00%
is 97.84% | in 1.61% | also 0.55% is 99.30% | changed 0.35% | also 0.35%
also 100.00% | also 0.00% | generally 0.00% also 100.00% | also 0.00% | Also 0.00%
known 99.98% | recognized 0.01% | widely 0.01% known 99.97% | recognized 0.03% | known 0.00%
for 100.00% | in 0.00% | from 0.00% for 100.00% | in 0.00% | as 0.00%
its 99.81% | his 0.13% | playing 0.06% its 99.27% | lay 0.41% | playing 0.32%
influence 99.78% | influ 0.20% | influen 0.02% influence 99.80% | influ 0.19% | role 0.01%
on 99.55% | outside 0.30% | across 0.15% on 99.96% | across 0.02% | outside 0.02%
the 96.43% | philosophy 3.46% | science 0.11% the 96.54% | philosophy 3.39% | science 0.07%
philosophy 99.03% | philosoph 0.96% | history 0.01% philosophy 99.26% | philosoph 0.68% | history 0.06%
of 99.96% | and 0.02% | . 0.01% of 99.98% | and 0.01% | . 0.01%
science 99.96% | physics 0.03% | religion 0.01% science 99.96% | physics 0.03% | space 0.01%
. 92.61% | 3.73% | , 3.66% . 95.72% | 3.26% | , 1.01%

(The change is noted with bold "DIFFERENCE" line in one cell above) Here are results with top_p = 0.9 and top_k = 0:

Mirostat:

Generating (1 / 94 tokens) [( Albert 54.20%) ( " 21.33%) ( Ein 8.63%) ( He 7.25%)]
Generating (2 / 94 tokens) [( Ein 100.00%)]
Generating (3 / 94 tokens) [(stein 100.00%)]
Generating (4 / 94 tokens) [( was 78.98%) ( ( 21.02%)]
Generating (5 / 94 tokens) [( a 82.89%) ( born 17.11%)]
Generating (6 / 94 tokens) [( German 81.23%) ( theoretical 13.47%) ( phys 5.30%)]
Generating (7 / 94 tokens) [(- 81.41%) ( phys 11.26%) ( theoretical 7.32%)]
Generating (8 / 94 tokens) [(born 100.00%)]
Generating (9 / 94 tokens) [( phys 22.82%) ( theoretical 77.18%)]
Generating (10 / 94 tokens) [(ic 100.00%)]
Generating (11 / 94 tokens) [(ist 100.00%)]
Generating (12 / 94 tokens) [( who 100.00%)]
Generating (13 / 94 tokens) [( developed 100.00%)]
Generating (14 / 94 tokens) [( the 100.00%)]
Generating (15 / 94 tokens) [( special 96.70%) ( general 3.30%)]
Generating (16 / 94 tokens) [( and 89.54%) ( theory 10.46%)]
Generating (17 / 94 tokens) [( general 100.00%)]
Generating (18 / 94 tokens) [( theories 100.00%)]
Generating (19 / 94 tokens) [( of 100.00%)]
Generating (20 / 94 tokens) [( relativ 100.00%)]
Generating (21 / 94 tokens) [(ity 100.00%)]
Generating (22 / 94 tokens) [(. 45.64%) (, 54.36%)]

Normal:

Generating (1 / 94 tokens) [( Albert 54.20%) ( " 21.33%) ( Ein 8.63%) ( He 7.25%)]
Generating (2 / 94 tokens) [( Ein 100.00%)]
Generating (3 / 94 tokens) [(stein 100.00%)]
Generating (4 / 94 tokens) [( was 78.98%) ( ( 21.02%)]
Generating (5 / 94 tokens) [( a 82.89%) ( born 17.11%)]
Generating (6 / 94 tokens) [( German 81.23%) ( theoretical 13.47%) ( phys 5.30%)]
Generating (7 / 94 tokens) [(- 81.41%) ( phys 11.26%) ( theoretical 7.32%)]
Generating (8 / 94 tokens) [(born 100.00%)]
Generating (9 / 94 tokens) [( phys 22.82%) ( theoretical 77.18%)]
Generating (10 / 94 tokens) [(ic 100.00%)]
Generating (11 / 94 tokens) [(ist 100.00%)]
Generating (12 / 94 tokens) [( who 100.00%)]
Generating (13 / 94 tokens) [( developed 100.00%)]
Generating (14 / 94 tokens) [( the 100.00%)]
Generating (15 / 94 tokens) [( general 3.30%) ( special 96.70%)]
Generating (16 / 94 tokens) [( theory 100.00%)]
Generating (17 / 94 tokens) [( of 100.00%)]
Generating (18 / 94 tokens) [( relativ 100.00%)]
Generating (19 / 94 tokens) [(ity 100.00%)]
Generating (20 / 94 tokens) [(, 66.05%) (. 33.95%)]

(The difference is "general" and "special" sampled).

@LostRuins, are you sure that mirostat changes token probabilities? For me it looks like it is not working at all. My results are cherry-picked, so that more similar tokens were chosen, but still – their probabilities are exactly the same.

Is there any other way to somehow confirm that mirostat actually kicked in? Can we print its internal state (if it has one) during generation?

It you want to be sure that I've enabled Mirostat, here are prints in console. Mirostat:

Namespace(bantokens=None, blasbatchsize=1024, blasthreads=14, contextsize=4096, debugmode=True, forceversion=0, gpulayers=0, highpriority=False, hordeconfig=None, host='', launch=True, lora=None, mirostat=[2, 5.0, 0.1], model=None, model_param='C:/GGML/pygmalion-13b-ggml-q5_1.bin', noavx2=False, noblas=False, nommap=False, port=5001, port_param=5001, psutil_set_threads=False, ropeconfig=[0.0, 10000.0], skiplauncher=False, smartcontext=False, stream=False, tensor_split=None, threads=14, unbantokens=True, useclblast=None, usecublas=None, usemirostat=None, usemlock=False)

Normal:

Namespace(bantokens=None, blasbatchsize=1024, blasthreads=14, contextsize=4096, debugmode=True, forceversion=0, gpulayers=0, highpriority=False, hordeconfig=None, host='', launch=True, lora=None, mirostat=None, model=None, model_param='C:/GGML/pygmalion-13b-ggml-q5_1.bin', noavx2=False, noblas=False, nommap=False, port=5001, port_param=5001, psutil_set_threads=False, ropeconfig=[0.0, 10000.0], skiplauncher=False, smartcontext=False, stream=False, tensor_split=None, threads=14, unbantokens=True, useclblast=None, usecublas=None, usemirostat=None, usemlock=False)
LostRuins commented 1 year ago

Uhhh.... From your log, you "mirostat" is off! See usemirostat=None displayed in your console print. Can you share your command line launch parameters (or screenshot the mirostat pane if you are using the GUI)?

aleksusklim commented 1 year ago

From your log, you "mirostat" is off! See usemirostat=None displayed in your console print.

LOL, really! How?

or screenshot the mirostat pane if you are using the GUI

I was using GUI, yes. Here are logs that I have if I just tick Use Mirostat on Tokens tab, leaving everything else at default:

***
Welcome to KoboldCpp - Version 1.37.1
For command line arguments, please refer to --help
***
Attempting to use OpenBLAS library for faster prompt ingestion. A compatible libopenblas will be required.
Initializing dynamic library: koboldcpp_openblas.dll
==========
Namespace(bantokens=None, blasbatchsize=512, blasthreads=9, contextsize=2048, debugmode=False, forceversion=0, gpulayers=0, highpriority=False, hordeconfig=None, host='', launch=True, lora=None, mirostat=[2, 5.0, 0.1], model=None, model_param='C:/GGML/pygmalion-13b-ggml-q5_1.bin', noavx2=False, noblas=False, nommap=False, port=5001, port_param=5001, psutil_set_threads=False, ropeconfig=[0.0, 10000.0], skiplauncher=False, smartcontext=False, stream=False, tensor_split=None, threads=9, unbantokens=False, useclblast=None, usecublas=None, usemirostat=None, usemlock=False)
==========
Loading model: C:\GGML\pygmalion-13b-ggml-q5_1.bin
[Threads: 9, BlasThreads: 9, SmartContext: False]

---
Identified as LLAMA model: (ver 5)
Attempting to Load...
---
Using automatic RoPE scaling (scale:1.000, base:10000.0)
System Info: AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
llama.cpp: loading model from C:\GGML\pygmalion-13b-ggml-q5_1.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_head_kv  = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: n_gqa      = 1
llama_model_load_internal: rnorm_eps  = 5.0e-06
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 9 (mostly Q5_1)
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =    0.11 MB
llama_model_load_internal: mem required  = 9945.07 MB (+ 1600.00 MB per state)
llama_new_context_with_model: kv self size  = 1600.00 MB
Load Model OK: True
Embedded Kobold Lite loaded.
Starting Kobold HTTP Server on port 5001
Please connect to custom endpoint at http://localhost:5001

But here is what I'll get if I'd use command-line:

koboldcpp.exe --usemirostat 2 5.0 0.1

***
Welcome to KoboldCpp - Version 1.37.1
For command line arguments, please refer to --help
***
Attempting to use OpenBLAS library for faster prompt ingestion. A compatible libopenblas will be required.
Initializing dynamic library: koboldcpp_openblas.dll
==========
Namespace(bantokens=None, blasbatchsize=512, blasthreads=9, contextsize=2048, debugmode=0, forceversion=0, gpulayers=0, highpriority=False, hordeconfig=None, host='', launch=False, lora=None, model=None, model_param='C:/GGML/pygmalion-13b-ggml-q5_1.bin', noavx2=False, noblas=False, nommap=False, port=5001, port_param=5001, psutil_set_threads=False, ropeconfig=[0.0, 10000.0], skiplauncher=False, smartcontext=False, stream=False, tensor_split=None, threads=9, unbantokens=False, useclblast=None, usecublas=None, usemirostat=[2.0, 5.0, 0.1], usemlock=False)
==========
Loading model: C:\GGML\pygmalion-13b-ggml-q5_1.bin
[Threads: 9, BlasThreads: 9, SmartContext: False]

---
Identified as LLAMA model: (ver 5)
Attempting to Load...
---
Using automatic RoPE scaling (scale:1.000, base:10000.0)
System Info: AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
llama.cpp: loading model from C:\GGML\pygmalion-13b-ggml-q5_1.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_head_kv  = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: n_gqa      = 1
llama_model_load_internal: rnorm_eps  = 5.0e-06
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 9 (mostly Q5_1)
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =    0.11 MB
llama_model_load_internal: mem required  = 9945.07 MB (+ 1600.00 MB per state)
llama_new_context_with_model: kv self size  = 1600.00 MB
Load Model OK: True
Embedded Kobold Lite loaded.
Starting Kobold HTTP Server on port 5001
Please connect to custom endpoint at http://localhost:5001

Indeed, GUI-version writes mirostat=[2, 5.0, 0.1] while cmd-version writes usemirostat=[2.0, 5.0, 0.1] Is this a naming bug somewhere?

As for testing, now token probabilities are completely different, and often equal to 100%, but not always. It is clearly ignoring sampler parameters, for example for top_p = 0.002, top_k = 1, temp = 0.1

Generating (60 / 94 tokens) [( of 100.00%)]
Generating (61 / 94 tokens) [( the 100.00%)]
Generating (62 / 94 tokens) [(  100.00%)]
Generating (63 / 94 tokens) [(2 100.00%)]
Generating (64 / 94 tokens) [(0 100.00%)]
Generating (65 / 94 tokens) [(th 100.00%)]
Generating (66 / 94 tokens) [( century 100.00%)]
Generating (67 / 94 tokens) [(, 56.06%) (. 42.31%) ( and 1.63%)]
Generating (68 / 94 tokens) [( and 100.00%)]
Generating (69 / 94 tokens) [( considered 99.37%) ( a 0.63%) ( one 0.00%) ( often 0.00%)]
Generating (70 / 94 tokens) [( by 100.00%)]
Generating (71 / 94 tokens) [( many 100.00%)]
Generating (72 / 94 tokens) [( to 100.00%) ( the 0.00%)]
Generating (73 / 94 tokens) [( be 100.00%)]
Generating (74 / 94 tokens) [( the 100.00%)]
Generating (75 / 94 tokens) [( greatest 99.98%) ( most 0.02%)]
Generating (76 / 94 tokens) [( phys 99.18%) ( scient 0.82%)]
Generating (77 / 94 tokens) [(ic 100.00%)]
Generating (78 / 94 tokens) [(ist 100.00%)]
Generating (79 / 94 tokens) [( of 100.00%)]
Generating (80 / 94 tokens) [( all 100.00%)]
Generating (81 / 94 tokens) [( time 100.00%)]
Generating (82 / 94 tokens) [(. 100.00%)]

But for top_p = 1, top_k = 0, temp = 2

Generating (73 / 94 tokens) [( Nobel 100.00%)]
Generating (74 / 94 tokens) [( Prize 100.00%)]
Generating (75 / 94 tokens) [( in 100.00%)]
Generating (76 / 94 tokens) [( Physics 100.00%)]
Generating (77 / 94 tokens) [( " 100.00%)]
Generating (78 / 94 tokens) [(for 100.00%)]
Generating (79 / 94 tokens) [( his 100.00%)]
Generating (80 / 94 tokens) [( services 100.00%)]
Generating (81 / 94 tokens) [( to 100.00%)]
Generating (82 / 94 tokens) [( theoretical 66.31%) ( The 33.69%)]
Generating (83 / 94 tokens) [( physics 100.00%)]
Generating (84 / 94 tokens) [(" 25.12%) ( through 44.24%) (". 30.64%)]
Generating (85 / 94 tokens) [( while 100.00%)]
Generating (86 / 94 tokens) [( working 100.00%)]
Generating (87 / 94 tokens) [( in 100.00%)]
Generating (88 / 94 tokens) [( Switzerland 100.00%)]
Generating (89 / 94 tokens) [(. 100.00%)]

So it is working? It used different words on subsequent runs, even if those words were not listed previously. For example:

Processing Prompt (1 / 1 tokens)
Generating (1 / 94 tokens) [( Albert 100.00%)]
Generating (2 / 94 tokens) [( Ein 100.00%)]
Generating (3 / 94 tokens) [(stein 100.00%)]
Generating (4 / 94 tokens) [( was 100.00%)]
Generating (5 / 94 tokens) [( a 100.00%)]
Generating (6 / 94 tokens) [( German 100.00%)]
Generating (7 / 94 tokens) [(- 100.00%)]
Generating (8 / 94 tokens) [(born 100.00%)]
Generating (9 / 94 tokens) [( theoretical 100.00%)]

Processing Prompt (1 / 1 tokens)
Generating (1 / 94 tokens) [( Albert 100.00%)]
Generating (2 / 94 tokens) [( Ein 100.00%)]
Generating (3 / 94 tokens) [(stein 100.00%)]
Generating (4 / 94 tokens) [( ( 34.03%) ( was 59.30%) (, 6.67%)]
Generating (5 / 94 tokens) [(1 100.00%)]
Generating (6 / 94 tokens) [(4 100.00%)]
Generating (7 / 94 tokens) [( March 100.00%)]
Generating (8 / 94 tokens) [(  100.00%)]
Generating (9 / 94 tokens) [(1 100.00%)]

(So, for proper mirostat those probabilities are almost meaningless, right?)

Turns out you should fix GUI internal key name for usemirostat. By the way, on Tokens tab the context length is capped at 4096, while on Quick Launch tab it allows to set as much as 8192.

LostRuins commented 1 year ago

Yes it is indeed a bug with my GUI key. Will be fixed in the next version, thanks!