Open Anteemony opened 4 months ago
As the output sent to LLM will depend on 1 prompt template, 2 user queries, and 3 chat histories, we can control it with some strategies:
@Anteemony @OscarArroyoVega, is this one still open? I haven't seen the current implementation using max_token_limit
.
Hello. No it’s not implemented yet.
This feature exists for the documents retrieved by the retriever.
If you take a look, you’ll see it’s slider UI is commented out (TODO) in tabs/play.py
The data collected there is to be passed to the format_docs
function where the total documents retrieved will be reduced until the tokens fall under the user’s preferred limit
Apologies, the initial issue description does not do Justice to the objective. The entire input token shouldn’t just be truncated from the front or the back because that would lead to loss of important information.
Truncating the tokens retrieved seems like the most effective method.
Also, the model max token limit can come into play here as the max value of the slider.
Perhaps the slider max value would have keep in mind the tokens brought about by the system prompt. Something along the lines of (max = model max tokens - prompt tokens )
.
This way the user can have a more accurate max token option that won’t lead to errors.
Feature allows the user to set the adjust the
max_tokens_limit
sent to the LLM This can be done with a slider or text input. It should have a Minimum value it can accept e.g. 100 and the maximum value should be the accounted LLM input limit. This will allow users to use powerful models and save input costs.