chronick commented 11 months ago

Summary

I would like to configure a persona to have a daily usage budget. I'm happy to contribute a pull request for this, I'm opening this issue to see if it is in line with the project vision.

Problem

I am piloting Danswer to possibly be used within my organization. One issue we have is if this project gains traction, our OpenAI bill could easily grow out of control, so we are looking for ways to rate-limit usage. OpenAI gives us control over a monthly bill allowance which is great, but we're hoping to have more granular control so we don't run out of our bill in the first few days and have to choose between increasing the bill or waiting until next month to start using it again. A configuration that allows us to set a daily or hourly budget would solve this problem.

Proposal

Initially, our thinking was to configure it using a dollar amount, but that does not easily scale to different LLM systems that may use different currencies and billing APIs. Therefore, we believe that a token-based budget would be simpler and avoid any extra dependencies.

We could set up a global budget for the whole install, but we believe having a per-persona configuration would be more flexible since we could have different budgets for different models or use cases.

Acceptance Criteria

Ultimately, the method for controlling this would need to accomplish the following:

Before any request to the LLM, first check a running total for the configured time period.
If the number of tokens is greater than the allotted token budget, prevent any further requests until the timeout ends.
After a successful request, add the tokens used for that request to the running total.

Mockup

Here is a rough mockup of what the persona configuration might look like.

Variant 1

This just has a "seconds" field

Variant 2

This has a "value" field plus a "seconds, hours, days" dropdown which might be more user friendly. Might want to make the UI a bit more compact

Methods

One method for accomplishing this would be before every request, look through a conversation history to get the number of tokens used for a given time period. Since it doesn't appear that there is any conversation history feature currently, we should probably look for a simpler method to accomplish this.

A simpler approach would be to save the above values in the persona configuration table:

token_budget_amount: How many tokens to allow in this time period
token_budget_period: How long the period lasts (in seconds)
tokens_used_period: Tokens used this period
token_budget_next_timeout: Next time the budget resets

This config could also be in a separate table or in a JSON column.

Before every request, we check the following:

If current time is after token_budget_next_timeout, reset tokens_used_period = 0 and token_budget_next_timeout += token_budget_period
If tokens_used_period >=token_budget_amount`, deny request with a message. Otherwise, make the request.
If the request is successful, add the tokens used from the API response (if available) to tokens_used_period.

Caveats

Since we don't know beforehand how many tokens a request will use, if we are below the budget by any amount (even 1), we would permit the request, no matter how long. This means that we would be allowed to go over the budget every time. Not sure if there is a reasonable way to prevent this, so we may just need to allow it and document it.

Open Questions

Token usage pricing separates input tokens and output tokens. How should we handle it? Some options:
- Use just input tokens in the budget (not sure it makes sense to use just output tokens)
- Use input tokens + output tokens combined
- Use 2 fields, save both, and prevent requests if either of them are over.
Database fields: I've listed a few options above, assuming we want to move forward with this feature, what is our preference?
Should we allow multiple budgets? IMO its probably not necessary but might inform how we approach persistence (if so, a separate table would be ideal)
Should we allow a global budget instead of, or in addition to this?
Should we allow the user to specify an offset to accommodate different preferred timezones?

mboret commented 11 months ago

This is a must-have! We have these exact needs.

I was thinking of something more straightforward, only based on a QPM (Question Per Minute) at the Danswer level. Regarding our max TPM available on Azure OpenAI, I know I can handle 3 QPM without reaching the rate-limiting (I'm "losing" some capacity, but I'm sure to be capable of handling the 3 questions).

We could set up a queuing system and inform the end-user that the question has been queued and is being processed in the next XX seconds. I prefer losing OpenAI capacity to hitting its rate limit and generating an error message to the end user, as it's a bad user experience. IMO, it would be better to let him know about the delay in processing his question.

chronick commented 11 months ago

Interesting idea! Thats good to know.

I will be on holiday until mid-January, I'll check back here then to see if there is any more interest. If so, I'm happy to submit a PR or discuss further if we need to refine the idea or prefer a different approach.

yuhongsun96 commented 11 months ago

Hey, this is a great addition, we'd love to have it in the project! Are you still open to doing it once you're back from holiday @chronick?

I think it would also be good to allow for a global limit, what are your thoughts?

Finally a point on the implementation, it may be good to have a new table to track usage for sections of time. Or a more intuitive but less speedy version could be to attach token count to every message and do a query for messages of a persona within the last x amount of time and sum that. There's possibly even better ways, let me know your thoughts!

chronick commented 10 months ago

Hi @yuhongsun96 sure, I'm happy to submit a PR for this.

I don't have a problem with a global limit conceptually, but I'm not sure how that would fit in with the rest of the global config. Perhaps its coupled with the configuration for the LLM server? Or would it go in its own section?

I did consider the approach of tracking usage and summing all tokens used within a time window, but was concerned with scope creep, since we would have to figure out what we want out of general usage tracking:

Do we store a history of all queries?
What data should we store? Are there space or privacy concerns with storing the full Q/A and token usage?
How does query history work with the relevance filter, or any future "LLM-based pre-processing", since those would use multiple queries? Do we store intermediate results in the history? how?
Do we clean up data after some interval, or do we allow the table to grow continuously? How would we implement that if so?

My solution of having a single entry with a mutable counter prevents us from having to design usage tracking first. But if we want to go with budgets using the usage tracker (however it is built), I'm happy to have that conversation.

yuhongsun96 commented 10 months ago

Ya I think either a global or per persona limit is perfectly fine, honestly whichever suits your use case best is great. Users will have a need for both anyway.

I think the counter that has windows works fine as well. 👍 Excited to have this addition!

Also we have been evaluating different embedding models, just a heads up. You might find better results with the following env variable settings:

DOCUMENT_ENCODER_MODEL=intfloat/e5-base-v2
DOC_EMBEDDING_DIM=768
NORMALIZE_EMBEDDINGS=True
ASYM_QUERY_PREFIX="query: "
ASYM_PASSAGE_PREFIX="passage: "

or you can use intfloat/e5-small-v2 with embedding dim of 384

grugnog commented 10 months ago

Dropped in to point out that LiteLLM already has a pretty decent budget facility (that can work in currency, not just tokens) and can perhaps be used as a backend.

chronick commented 10 months ago

@yuhongsun96 diving into this finally, it looks like there's an existing chat_message table I can query to accomplish this, which I didn't realize before. Getting a sum of all tokens used within a timeframe is pretty trivial with that, so I can just check that the total count within the window is less than the budget config.

Thinking about this more, a global limit does seem like the better feature to implement first, but I'm not sure where to store that kind of config. On the UI side, adding it to the "LLM" tab makes sense to me, but I don't see any kind of global config. Is there any existing table I could use to store it globally?

Alternatively, I'm fine in my use case with having environment variables instead of attempting to store config in the db somewhere.

Thinking we'd have two env vars: GLOBAL_TOKEN_BUDGET_TOKENS: GLOBAL_TOKEN_BUDGET_WINDOW_LENGTH_HOURS: Number of hours in the budget window. This would probably cover 99% of use cases for folks.

yuhongsun96 commented 10 months ago

We have a key/value store which is just persisted in a file currently but we're planning to port it over to Postgres at some point. The class is here: https://github.com/danswer-ai/danswer/blob/main/backend/danswer/dynamic_configs/file_system/store.py#L20

The usage is super easy, you can just check the code for instances of get_dynamic_config_store(), like here for example: https://github.com/danswer-ai/danswer/blob/main/backend/danswer/server/manage/administrative.py#L116

This may be a better replacement for the two env vars and can be hooked up to APIs so you can add the config to the LLM page

chronick commented 9 months ago

Hi, I've submitted a PR as a starting point. I've implemented it as a middleware against the stream-answer-with-quote route. We can add it to other routes as needed. It currently returns a 429 error but that doesn't seem to be handled much by the frontend. Is there a better response to return?

chronick commented 8 months ago

@Weves @yuhongsun96 I've updated the PR as per your feedback. Please let me know if theres anything else that needs doing before we can merge it!

danswer-ai / danswer

Feature Idea: Token Budgets #830