Expand token based compression to consider the whole input to the LLM

There's a few of these errors

Error code: 400 - {'error': {'message': "This model's maximum context length is 8192 tokens. However, your messages resulted in 8202 tokens. Please reduce the length of the messages.", 'type': 'invalid_request_error', 'param': 'messages', 'code': 'context_length_exceeded'}}

By default, we set the experiment's max token limit (under edit experiment -> safety) to 8192, which in the case of the above error is also the model's maximum context length. Since we only consider the chat history + summary when compressing, this results in the corner case when the history + summary comes close to the model's token length since it is just a component of the final input to the LLM. The other components of the input (system + user input + all other data we might give) sometimes drives the total context length to over the model's maximum again.

Two solutions

a non-robust, but intrim one: We set the experiment's token limit to be << model's maximum to allow the rest of input components some space to wiggle.
The more robust one: We consider all inputs to the LLM when checking the token count (but we only ever compress the history, which we currently do)

dimagi / open-chat-studio

Expand token based compression to consider the whole input to the LLM #463