[Implement Optimization] Skip Inference for Predefined Tokens in Response Formatting

Garstig commented 4 months ago

Problem

I need to create a lot of small JSONs with a LLM. To do so I started with Jsonformer. However, since this is not maintained anymore and my colleagues use this librabry, I wanted to change.

In a test I realized that jsonformer is 2-3 times as fast for creating a json with a single boolean value.

I looked in the code and realized, that Jsonformer only creates the model values for the json from the LLM inference, as the rest of the output is defined by the given response_format. LLama-cpp doesn't do this.

My idea for a solution Disclaimer: In the end it might be better to solve this in llama.cpp. Since I'm not proficient in C++, I thought of suggesting it here first, where I have some understanding.

At first, we transform the response_format to a list of token_ids / types.

For example: We want dicts, that look like this:

{'born_in_Germany': bool}.

This could be transformed to:

predefined_tokens = [1, 12012, 6363, 28730, 262, 28730, 28777, 858, 1164, 1869, 28705, <class 'bool'>, 1, 443]

Now we can iterate through the list. If we get a token id, we skip the inference. If we get a type, we ask the model for a value.

The complicated part would be to determine, if the model is done with generating the currect value. But we could copy the logic from jsonformer to achieve that.

My first monkeypatch-testing

In the eval in llama.py we I added a skip_token. As a value it has the next token_id if it als already predefined by the result_template. If it is None, we want to inference the model.

    def eval(self, tokens: Sequence[int], skip_token = None):
        """Evaluate a list of tokens.

        Args:
            tokens: The list of tokens to evaluate.
            skip_token: If skip_token is set, we skip the inference and just use this token_id as prediction.
        """
        assert self._ctx.ctx is not None
        assert self._batch.batch is not None
        self._ctx.kv_cache_seq_rm(-1, self.n_tokens, -1)
        if skip_token:
            # I modified all variablies here instead of the loop as I tried a lot of different values. 
            batch = tokens[0: min(len(tokens), 0 + self.n_batch)]
            n_past = self.n_tokens
            n_tokens = len(batch)
            self._batch.set_batch(
                batch=batch, n_past=n_past, logits_all=self.context_params.logits_all
            )
            self.input_ids[n_past : n_past + n_tokens] = batch
            self.n_tokens += n_tokens

            offset = (
                0 if self.context_params.logits_all else n_tokens - 1
            ) 

            logits_tmp = np.zeros(32000) # symbolize the logit output of the llm. everything is 0 except 
            logits_tmp[skip_token] = 1   # the token that was given as an argument
            self.scores[n_past + offset : n_past + n_tokens, :].reshape(-1)[
            :
            ] =  logits_tmp
            return
        for i in range(0, len(tokens), self.n_batch):
            batch = tokens[i : min(len(tokens), i + self.n_batch)]
            n_past = self.n_tokens
            n_tokens = len(batch)
            self._batch.set_batch(
                batch=batch, n_past=n_past, logits_all=self.context_params.logits_all
            )
            self._ctx.decode(self._batch)
            # Save tokens
            self.input_ids[n_past : n_past + n_tokens] = batch
            # Save logits
            rows = n_tokens
            cols = self._n_vocab
            offset = (
                0 if self.context_params.logits_all else n_tokens - 1
            )  # NOTE: Only save the last token logits if logits_all is False
            self.scores[n_past + offset : n_past + n_tokens, :].reshape(-1)[
                :
            ] = self._ctx.get_logits()[offset * cols : rows * cols]
            # Update n_tokens
            self.n_tokens += n_tokens

This code runs, however the the outputs of the llm seem to be random. I tested it with a list of famous people and the llm should decide, if the person was born in Germany. After my modifications it did a lot of mistakes. Before it had an accuracy of 100 %.

My guess is, that the llm does not run with the correct input but it is hard to validate that, To be honest, I do not completly understand all of the variables and got kinda lost. It would be awesome to get some help! If you need more of the code to test stuff, I can provide it. Right now I think it would confuse more people than it would help, as I probably have a logic mistake and not a bug in the code.

abetlen commented 4 months ago

Hey @Garstig this is something I'm planning to implement actually for grammars, I believe the correct mechanism to implement this actually is to use speculative decoding to suggest the predefined tokens you mention here.

You should check out the api for that in llama/llama_speculative.py, you should be able to implement what you're trying to do with that interface.

Garstig commented 4 months ago

Hi @abetlen!

Thanks for your response! I hope I can check out your provided solution this week :)

Garstig commented 3 months ago

Hi,

a little late, but I finally got to test it out. Sady I do not see any speed gains, is seems to get even slower.

Did I do something wrong? Here is short gist that should run on google colab without modification.

abetlen / llama-cpp-python

[Implement Optimization] Skip Inference for Predefined Tokens in Response Formatting #1203