Closed NeonBohdan closed 5 months ago
I think this feature is most useful for models using static prompts (a.k.a. system prompts). See for example the system prompt used by StableLM: https://github.com/Stability-AI/StableLM#quickstart. In this case, reusing the model state after the system prompt is very useful because it will never change. It's also easy to make it compatible with batch generation since the prompt length is fixed.
On the other hand, the context you are referring to has a fixed size (e.g. 2048 tokens). Once the maximum size is reached, you need to remove tokens at the beginning of the context (rolling context) and the cached model state should be invalidated and recomputed fully. Here there is less benefit especially if the context length is small.
So I will probably start implementing something for static prompts in priority. In the meantime feel free to post additional insights or examples of this feature in other projects.
DialoGPT reads a user input and then generates text until EOS token uppears But for DialoGPT EOS can uppear multiple times And after every responce to keep the context you need to feed londer and londer input every time doing the same calculations
Is there is the way to output model state after generation is completed So that next time I will provide state to model instead of again feeding lond text
It's called stateful inference, it will be wery useful for requests with repeating context