Open Garstig opened 4 months ago
Hey @Garstig this is something I'm planning to implement actually for grammars, I believe the correct mechanism to implement this actually is to use speculative decoding to suggest the predefined tokens you mention here.
You should check out the api for that in llama/llama_speculative.py
, you should be able to implement what you're trying to do with that interface.
Hi @abetlen!
Thanks for your response! I hope I can check out your provided solution this week :)
Problem
I need to create a lot of small JSONs with a LLM. To do so I started with Jsonformer. However, since this is not maintained anymore and my colleagues use this librabry, I wanted to change.
In a test I realized that jsonformer is 2-3 times as fast for creating a json with a single boolean value.
I looked in the code and realized, that Jsonformer only creates the model values for the json from the LLM inference, as the rest of the output is defined by the given response_format. LLama-cpp doesn't do this.
My idea for a solution Disclaimer: In the end it might be better to solve this in llama.cpp. Since I'm not proficient in C++, I thought of suggesting it here first, where I have some understanding.
At first, we transform the response_format to a list of token_ids / types.
For example: We want dicts, that look like this:
{'born_in_Germany': bool}
.This could be transformed to:
predefined_tokens = [1, 12012, 6363, 28730, 262, 28730, 28777, 858, 1164, 1869, 28705, <class 'bool'>, 1, 443]
Now we can iterate through the list. If we get a token id, we skip the inference. If we get a type, we ask the model for a value.
The complicated part would be to determine, if the model is done with generating the currect value. But we could copy the logic from jsonformer to achieve that.
My first monkeypatch-testing
In the eval in llama.py we I added a skip_token. As a value it has the next token_id if it als already predefined by the result_template. If it is None, we want to inference the model.
This code runs, however the the outputs of the llm seem to be random. I tested it with a list of famous people and the llm should decide, if the person was born in Germany. After my modifications it did a lot of mistakes. Before it had an accuracy of 100 %.
My guess is, that the llm does not run with the correct input but it is hard to validate that, To be honest, I do not completly understand all of the variables and got kinda lost. It would be awesome to get some help! If you need more of the code to test stuff, I can provide it. Right now I think it would confuse more people than it would help, as I probably have a logic mistake and not a bug in the code.