LostRuins / koboldcpp

Run GGUF models easily with a KoboldAI UI. One File. Zero Install.
https://github.com/lostruins/koboldcpp
GNU Affero General Public License v3.0
5.1k stars 355 forks source link

New API feature that generates one token but shows its logits / probabilities ("Manual sampling" mode) #975

Open aleksusklim opened 3 months ago

aleksusklim commented 3 months ago

I can name several use cases when a user might want to see not only the randomly generated text, but an exact list of possibilities for the next token. For example:

  1. When a story progresses, you might give several options for the model to choose one from. If you set temp=0/top_k=1 you will get the most probable one. Otherwise, you could retry several times getting all of them randomly, but have no way to actually rank them to judge the model's preference. Examples are: path to pick, seconds to wait, character name to address, item to use, etc.
  2. You ask the model to list many things (for example, "give me a random city name") and want to understand how much of them it really knows about. Even if you ask it to reply with a sequential list, at some point it would be still interesting to see "what else it thinks of" after it generated most common or easy ones.
  3. The model refused to answer your question on-topic, or it roleplays with a different path than you want it to pick. Then everything gets down to "probing" it, adding some hinting words to the start of its own answer (for example, if you want it to speak from first-person, start with "I", if you want it to generate a quote – then add a quotation mark, and so on). But this greatly limits the randomness of model's actions, because basically you throw away one or several its "default" (most probable currently) responses, but choosing your own made-up one – without any way to undercover what else it could say instead!
  4. You don't like the model response and you retry. But it generates the same sentence again! You retry again, but it stuck here. Then you have to raise the temperature or otherwise alter sampling settings temporary to just see any differences between retries, and then don't forget to change it back.
  5. You want to compare the behavior or several different models (or finetunes) in objective manner. For example, you give it a math question and stop your generation right before the actual answer. Then you want to know exactly for how much the probability of the right answer increased by changing the model. This could involve more prompt-tuning (because the other model might need a different template or wordings), but at some point you would still need the raw probabilities to compare!

I understand that even a direct list of token prediction is not enough to judge everything I mentioned above, simply because the model might start its response with unrelated filler words instead of your expected token (for example, you write ", so the answer is" and at low-probable tokens it might want to say "as simple as", leaving you with another logical point where you would need to account for the probabilities again), but you still need the ability to see the list anyhow!

Why not to use "Debug mode" that shows the probabilities, you might ask? Here is why:

  1. You have to enable it, meaning restarting the server. Often I need to see the probabilities long after I started the story, and I just don't want to restart it for this. Sometimes I leave the debug mode enabled if I assume I might need it later.
  2. You have to switch to the console to see it. Not convenient when working over a remote connection in browser only.
  3. It shows only a few top tokens. Sometimes it is enough, but for several mentioned tasks I want a much larger list! In the best case, it should be configurable at runtime, so I can alter it as needed currently.
  4. It does not show very low-probable tokens (less than 0.5%?), even if they should theoretically exist there. I often see "100%" in debug tokens, even with temp>0 and top_k>1, and not in the middle of a word.
  5. Debug probabilities are rounded, but I might want to see raw values for very subtle comparations, for example testing an influence of different prompt wordings on the model accuracy.

I propose a mode called "Manual sampling". To activate it in Lite, the user clicks on "Chat Select" (to the left of input box) and in the popup choses "Manual sampling (reveal token probabilities)" – the new proposed link, below existing "Impersonate the AI Assistant" and "Make the AI write a response as me (for 1 turn)".

After that, the request is performed as currently (sending everything and clearing the input box), but the interface changes:

Additional entries in Settings:

Instead of generating just one token, we can procced to normal stopping conditions, and generate as much as always. But instead of exiting the yellow generation mode (even if Abort was pressed), we can stay in manual sampling: the server should add the array of all of token probabilities to its response, for each generated token. Along with their token indices and the chosen one (for the client to know exactly without tokenizing anything again). This way, Lite can render a pre-populated table stack, so pressing "Rewind" will discard the last token and show the probabilities at the previous one (starting the generation from here again if the user chooses a different path).

Hmm, at this point it becomes more convenient to click on the exact part of the yellow text that I want to change… Maybe make them as separate HTML elements with hover hint on them (highlighting this token boundary and showing its main probability), and with onclick handler that will "rewind here" in the stack? (Of course, when the text becomes white – all of this is discarded and the stack is freed) Or at least maybe "Delete" shortcut key will "Rewind to the beginning", and then you could press "Enter" several times, instantly "typing back" already cached tokens?

I pay so much attention to the client interface because it is more important than just knowing or printing the probabilities, since you need a quick and effective way to explore them! But the convenient API is important too, because I can imagine a custom script that does full "tree search": generating an answer, then picking the most probable "path" (multiplying intermediate probabilities) that was not explored yet, and generating again until all paths summing to the desired probability would be explored – and then adding up all of final probabilities of each branch with the right answer (externally detected anyhow) – to estimate the "general probability of a correct answer" for a model, regardless of the wording of its different answers (and since this accounts for ALL or probabilities, not just temp=0/top_k=1 – such estimation will be more accurate than greedy sampling or simple retrying).

Don't get me wrong, I know there are other ways available to see the logits, starting from various test scripts or frontends (as "server" in llama.cpp) and ending up with raw calls to the model from the code. But I just LOVE KOBOLDCPP SO MUCH that I found everything else so inconvenient and complicated to use (or to transfer my settings/models to)! And it is not like I'm "okay, so I want to compare logits in this specifically crafted prompt, lemme find a right tool", it is more like "oh, it would be really great to see the probabilities here right in this moment of my large story…"

Since this change is so complicated and affecting many parts of the existing codebase (and interfering with most important UI elements), you should consider to implement it only if you truly want: if you like the idea and see it as actually useful both for yourself and for your casual users! (This should not be done in a hacky manner, otherwise it might complicate supporting the code later).

Related (?):

LostRuins commented 3 months ago

Actually I do want to add proper logprobs to the API, not just for one token, I just haven't found the time to get around to it.

aleksusklim commented 3 months ago

If you would do the API side, I can try to hack around a userscript that adds my proposed UI to Lite, so we can at least see how it might look like, without incorporating everything properly into Lite source for now.

Will the token probabilities be correctly exposed when

LostRuins commented 3 months ago

The way I'm imagining it to be on API side would be essentially the same as how logprobs are sent over the OpenAI API (per token).

For Lite, it will probably be a separate panel that opens a table, with two columns, one token per row for the first column, and the top 5 logprobs with their text in the second column.

Btw for point number 4, ensure that top-k is 0 or > 5, and top-p is set to 1.0 with other truncation samplers disabled.

aleksusklim commented 3 months ago

I was thinking more about factual probabilities than raw logprobs. Meaning, top_k=1 will essentially output greedy tokens, each 100% the only one there (as Debug mode currently). I see some valuable information here, like judging the direct influence of current sampler parameters (especially, the temperature itself).

On the other hand, raw logits are also might be interesting… I'm not sure whether one is better that the other, but anyway – if you need to see all, you may raise sampler limits, yes. The only downside is that you cannot trigger a greedy sampling if you allow anything than 100% in outputs, but if we call that a "manual sampling" – I think this is not a problem, since you still would be able to tap Home+Enter+Home+Enter… to choose the first of each (as I imagine).

But for OpenAI api to be compatible – you have to return logprobs…

aleksusklim commented 3 months ago

one token per row for the first column, and the top 5 logprobs with their text in the second column

Without a stacked tables with ~unlimited (configurable) number of candidate tokes for each – this would be the same as just looking at Debug mode output. No way to see this in "dynamic", how chosen previous tokens are changing the following probabilities interactively.

That's why I said we need to try this to see, is it really as useful as I imagine it to be! For the start, koboldcpp should return probabilities to Lite, given a flag and limits as I explained. Hmm, even the streaming and aborting behavior is not important at first!