Open chronick opened 11 months ago
This is a must-have! We have these exact needs.
I was thinking of something more straightforward, only based on a QPM (Question Per Minute) at the Danswer level. Regarding our max TPM available on Azure OpenAI, I know I can handle 3 QPM without reaching the rate-limiting (I'm "losing" some capacity, but I'm sure to be capable of handling the 3 questions).
We could set up a queuing system and inform the end-user that the question has been queued and is being processed in the next XX seconds. I prefer losing OpenAI capacity to hitting its rate limit and generating an error message to the end user, as it's a bad user experience. IMO, it would be better to let him know about the delay in processing his question.
Interesting idea! Thats good to know.
I will be on holiday until mid-January, I'll check back here then to see if there is any more interest. If so, I'm happy to submit a PR or discuss further if we need to refine the idea or prefer a different approach.
Hey, this is a great addition, we'd love to have it in the project! Are you still open to doing it once you're back from holiday @chronick?
I think it would also be good to allow for a global limit, what are your thoughts?
Finally a point on the implementation, it may be good to have a new table to track usage for sections of time. Or a more intuitive but less speedy version could be to attach token count to every message and do a query for messages of a persona within the last x amount of time and sum that. There's possibly even better ways, let me know your thoughts!
Hi @yuhongsun96 sure, I'm happy to submit a PR for this.
I don't have a problem with a global limit conceptually, but I'm not sure how that would fit in with the rest of the global config. Perhaps its coupled with the configuration for the LLM server? Or would it go in its own section?
I did consider the approach of tracking usage and summing all tokens used within a time window, but was concerned with scope creep, since we would have to figure out what we want out of general usage tracking:
My solution of having a single entry with a mutable counter prevents us from having to design usage tracking first. But if we want to go with budgets using the usage tracker (however it is built), I'm happy to have that conversation.
Ya I think either a global or per persona limit is perfectly fine, honestly whichever suits your use case best is great. Users will have a need for both anyway.
I think the counter that has windows works fine as well. 👍 Excited to have this addition!
Also we have been evaluating different embedding models, just a heads up. You might find better results with the following env variable settings:
DOCUMENT_ENCODER_MODEL=intfloat/e5-base-v2
DOC_EMBEDDING_DIM=768
NORMALIZE_EMBEDDINGS=True
ASYM_QUERY_PREFIX="query: "
ASYM_PASSAGE_PREFIX="passage: "
or you can use intfloat/e5-small-v2 with embedding dim of 384
Dropped in to point out that LiteLLM already has a pretty decent budget facility (that can work in currency, not just tokens) and can perhaps be used as a backend.
@yuhongsun96 diving into this finally, it looks like there's an existing chat_message
table I can query to accomplish this, which I didn't realize before. Getting a sum of all tokens used within a timeframe is pretty trivial with that, so I can just check that the total count within the window is less than the budget config.
Thinking about this more, a global limit does seem like the better feature to implement first, but I'm not sure where to store that kind of config. On the UI side, adding it to the "LLM" tab makes sense to me, but I don't see any kind of global config. Is there any existing table I could use to store it globally?
Alternatively, I'm fine in my use case with having environment variables instead of attempting to store config in the db somewhere.
Thinking we'd have two env vars:
GLOBAL_TOKEN_BUDGET_TOKENS
: GLOBAL_TOKEN_BUDGET_WINDOW_LENGTH_HOURS
: Number of hours in the budget window. This would probably cover 99% of use cases for folks.
We have a key/value store which is just persisted in a file currently but we're planning to port it over to Postgres at some point. The class is here: https://github.com/danswer-ai/danswer/blob/main/backend/danswer/dynamic_configs/file_system/store.py#L20
The usage is super easy, you can just check the code for instances of get_dynamic_config_store(), like here for example: https://github.com/danswer-ai/danswer/blob/main/backend/danswer/server/manage/administrative.py#L116
This may be a better replacement for the two env vars and can be hooked up to APIs so you can add the config to the LLM page
Hi, I've submitted a PR as a starting point. I've implemented it as a middleware against the stream-answer-with-quote
route. We can add it to other routes as needed. It currently returns a 429 error but that doesn't seem to be handled much by the frontend. Is there a better response to return?
@Weves @yuhongsun96 I've updated the PR as per your feedback. Please let me know if theres anything else that needs doing before we can merge it!
Summary
I would like to configure a persona to have a daily usage budget. I'm happy to contribute a pull request for this, I'm opening this issue to see if it is in line with the project vision.
Problem
I am piloting Danswer to possibly be used within my organization. One issue we have is if this project gains traction, our OpenAI bill could easily grow out of control, so we are looking for ways to rate-limit usage. OpenAI gives us control over a monthly bill allowance which is great, but we're hoping to have more granular control so we don't run out of our bill in the first few days and have to choose between increasing the bill or waiting until next month to start using it again. A configuration that allows us to set a daily or hourly budget would solve this problem.
Proposal
Initially, our thinking was to configure it using a dollar amount, but that does not easily scale to different LLM systems that may use different currencies and billing APIs. Therefore, we believe that a token-based budget would be simpler and avoid any extra dependencies.
We could set up a global budget for the whole install, but we believe having a per-persona configuration would be more flexible since we could have different budgets for different models or use cases.
Acceptance Criteria
Ultimately, the method for controlling this would need to accomplish the following:
Mockup
Here is a rough mockup of what the persona configuration might look like.
Variant 1
This just has a "seconds" field
Variant 2
This has a "value" field plus a "seconds, hours, days" dropdown which might be more user friendly. Might want to make the UI a bit more compact
Methods
One method for accomplishing this would be before every request, look through a conversation history to get the number of tokens used for a given time period. Since it doesn't appear that there is any conversation history feature currently, we should probably look for a simpler method to accomplish this.
A simpler approach would be to save the above values in the persona configuration table:
token_budget_amount
: How many tokens to allow in this time periodtoken_budget_period
: How long the period lasts (in seconds)tokens_used_period
: Tokens used this periodtoken_budget_next_timeout
: Next time the budget resetsThis config could also be in a separate table or in a JSON column.
Before every request, we check the following:
token_budget_next_timeout
, resettokens_used_period = 0
andtoken_budget_next_timeout += token_budget_period
tokens_used_period >=
token_budget_amount`, deny request with a message. Otherwise, make the request.tokens_used_period
.Caveats
Open Questions