Open twardoch opened 3 months ago
Agreed this is something we should have--I can take this on next week if there hasn't been an attempt by then. There are a few things here:
Prompt(..., content_compression_func, ...)
def content_compression_func(input_content: str, max_output_tokens: int, encode_func: Callable[[str], list[int]]) -> str
The user would need to ensure that when encoded, the output string is below the max_output_tokens
. I believe we need to quantify max_output_tokens
to ensure we end up below the token limit in a context-overflow scenario.
As far as getting this running at scale, which is crucial for us, we need to then do the following.
First we should ensure that the input_content
does not change every time we have context-overflow--this can be achieved by the truncation step buffer we have today.
Then, we need to ensure that once a specific input_content
has been mapped to a compressed state, that this state does not change. If it does change on successive generation attempts then it will break our model server caching which will break our serving economics.
Another name for this function might be context_overflow_func
though I think content_compression_func
is a bit more explicit as to the behavior of the function. Not very opinionated here, though.
any update here?
Truncation is one way to solve context overflow. Another is summarization.
(Btw the magic keyword to trigger good summarization from any model is not asking it to do a summary but asking it to do a TLDR. Works much better, you don't need very verbose prompts.)
It would be great if Prompt Poet could allow for a custom hook function (or does it already) that gets triggered if we context-overflow.
In certain situations, I generally prefer to call small local model or a cheap remote model with
TLDR
to compress older prompts (the chat history for example), rather than truncating.