Open TomKrcmar opened 1 year ago
This may be possible, but exposing 52000 new logits float values every single token over http may be slow and suboptimal. I'd have to see if there's a better way to do it.
I am not familiar with the architecture, just brainstorming:
Maybe at this point it would be easier and more efficient to implement some kind of plug-in architecture where the server can call external code from the plugin to process data?
The plugin would have a name and a list of parameters. The frontend can then request "run MyAwesomeSamplingPlugin
with parameters { paramA: 2.0, paramB: 1.3 }
and if such a plugin is loaded, the server obliges.
Adding to the thoughts above, half the reason I use the API is simply to use more fast prototyping languages like javascript or python to quickly stand up a wrapper API of my own and/or run some experiments. If I could get better access to things like logits, embeddings, network layer state, etc. via external language bindings, I would also be happy - The main reason for suggesting it over API was for convenience of writing code outside of statically linked llamacpp in C/C++. Language bindings are probably outside the scope of this repo, though.
Consider adding to the API a request parameter/separate endpoint to enable skipping the sampling, returning the list of single next-token probabilities.
Requests for this unsampled output type could omit all of the sampling parameters (top-p, top-k etc.) and instead include a limit parameter (maybe default 100 or 1000 with the option to remove limit).
Output for this request type may or may not include the token index, and might look something like:
This is an enabler for applications that fall into these categories:
Applications that want very specific sampling rules. For example the OpenAI-like function-calling ideas thrown around in this thread. (Eg. Pick most probable next token that doesn't cause the output to be invalid JSON) https://news.ycombinator.com/item?id=36331206 Or another example, a rule where an answer is required, but EOS token is allowed to stop generating. EOS must be unbanned in general, but cherry-picked OUT of the list for the 1st call, so we want the most probable token that isn't EOS, then allow EOS after that. In both examples, If I just get all the tokens, I can write this custom logic and sample myself.
Research or Datascience applications that actually want to do work on, or visualize, the probability distribution itself.
Applications that want to use unconventional sampling methods, or tweak the ones baked into koboldcpp. As an example, this is a potential solution to https://github.com/LostRuins/koboldcpp/issues/235 because that user could just get the logits and sample themselves easily. Research applications with esoteric sampling methods also fall heavily into this category.
This would also pre-empt all future issues asking for more sampling methods, as it allows the user to make their own if they wish.