Open br3no opened 6 months ago
Hi @br3no , I have a couple of questions on this Issue. Can you pls share more detail on these?
ff-tokens are fast-forward tokens. When you are generating guided output, e.g. a json object, there are moments when you don't really need an LLM to generate the next tokens, because the next tokens are specified by the guide. This reduces the load on the GPU and is generally much faster, as you only need to traverse the state-machine.
Write
and Generate
are instructions. A Generate
instruction signals that the next step in the sequence requires an LLM generation. The tokens
member variable contains the valid next tokens in the sequence, according to the guide (the state machine). A Write
instruction signals that the next step(s) in the sequence does not require an LLM generation. The tokens
member variable then contains the next tokens in the sequence.
Thank you @br3no ! Much appreciated!
@br3no are there any new developments for fast-forward / accelerate?
also curious about the state of things here.
@simon-mo @rlouf do you know the latest on this?
Describe the issue as clearly as possible:
See: https://github.com/outlines-dev/outlines/blob/d6a2b7908065d420456118723f69908c4094c1f8/outlines/integrations/vllm.py#L110
Here the
tokens
field of the next instruction is treated equally regardless of whether it is of typeGenerate
orWrite
.If a
Write
instruction has atokens
field with length > 1, this means we will accept any of the next ff-tokens as the token in the next step. This is incorrect.Steps/code to reproduce the bug:
Expected result:
Error message:
No response
Outlines/Python version information:
Version information
Context for the issue:
Bug was discussed in a call with @rlouf.