Resume model execution from where it stopped

OpenNMT / CTranslate2

Fast inference engine for Transformer models

MIT License

3.23k stars 286 forks source link

I think this feature is most useful for models using static prompts (a.k.a. system prompts). See for example the system prompt used by StableLM: https://github.com/Stability-AI/StableLM#quickstart. In this case, reusing the model state after the system prompt is very useful because it will never change. It's also easy to make it compatible with batch generation since the prompt length is fixed.

On the other hand, the context you are referring to has a fixed size (e.g. 2048 tokens). Once the maximum size is reached, you need to remove tokens at the beginning of the context (rolling context) and the cached model state should be invalidated and recomputed fully. Here there is less benefit especially if the context length is small.

So I will probably start implementing something for static prompts in priority. In the meantime feel free to post additional insights or examples of this feature in other projects.

OpenNMT / CTranslate2

Resume model execution from where it stopped #1210