Open wxwatcher2004 opened 10 months ago
The complexity is to decide how to pick what to send within that input content. Assume you have a fixed budget of 1000 tokens, but the message you are typing is 1200 (excluding chat history). How would the app make the decision on what to put/truncate?
It would just take the last 1000 tokens. In reality you have on all the models a 16k or larger token limit so you have enough for the message and the issue is the chat history as usually the previous chat is all that is needed. If you wanted you could specify how many previous chats to send, but I think that would be more work than having more advanced users set the input token limit.
Deciding what to omit is key. For instance, the system prompt is very important. Should not be the first to be cut.
To my knowledge this problem (context stuffing given constraints) hasn't been solved satisfactorily by anyone yet.
The issues is that to select what to to omit from the context you need some sort of intelligence: either Human (a person picks which messages to exclude from the context) or machine (embeddings, or better a smaller gpt network).
No simple issue to your request. I'm leaning towards empowering the user to manually choose what to exclude from the input of the llm. And maybe have a button to suggest what to remove (but again, that requires intelligence).
I completely agree with the strategy of empowering users to control the manipulation of input in these scenarios.
Some thoughts that occurred to me while reading this:
Regarding the Intelligence aspect, I believe this could either enhance or be enhanced by a Condenser: https://github.com/enricoros/big-AGI/issues/292
For a straightforward approach, the Condenser might be utilized on an on-demand or opt-in basis, triggered by a pre-set threshold or governed by rule-based logic (for example, "condense every 10 turns").
Additionally, the Condenser engine could be specifically designed for this context, offering capabilities for detection and pruning in addition to condensation. This would enable it to automatically identify moments during a conversation when condensation might be beneficial.
Why API usage is paid for by tokens and is stateless. Currently there is only control for output size and the conversation is sent for context as input. On a 32k token model the costs increase fast when you just need it to remember the last chat such as when using it for code development.
Concise description When using API calls to Mistral, Google, or OpenAI I can control the output context but not the input. Checking my token usage as the conversation grows there is an increase in cost per message due to the input tokens becoming the size of the max token. (I.E. 28K input for a 4K output)
Requirements Add another slider for input tokens next to the output token slider that determines how much context to send in the API call. It can default to max for new users, but more advanced users can adjust it to reduce their API bills.