OpenNMT / CTranslate2

Fast inference engine for Transformer models
https://opennmt.net/CTranslate2
MIT License
3.23k stars 286 forks source link

Resume model execution from where it stopped #1210

Closed NeonBohdan closed 5 months ago

NeonBohdan commented 1 year ago

DialoGPT reads a user input and then generates text until EOS token uppears But for DialoGPT EOS can uppear multiple times And after every responce to keep the context you need to feed londer and londer input every time doing the same calculations

Is there is the way to output model state after generation is completed So that next time I will provide state to model instead of again feeding lond text

It's called stateful inference, it will be wery useful for requests with repeating context

guillaumekln commented 1 year ago

I think this feature is most useful for models using static prompts (a.k.a. system prompts). See for example the system prompt used by StableLM: https://github.com/Stability-AI/StableLM#quickstart. In this case, reusing the model state after the system prompt is very useful because it will never change. It's also easy to make it compatible with batch generation since the prompt length is fixed.

On the other hand, the context you are referring to has a fixed size (e.g. 2048 tokens). Once the maximum size is reached, you need to remove tokens at the beginning of the context (rolling context) and the cached model state should be invalidated and recomputed fully. Here there is less benefit especially if the context length is small.

So I will probably start implementing something for static prompts in priority. In the meantime feel free to post additional insights or examples of this feature in other projects.