AllYourBot / hostedgpt

An open version of ChatGPT you can host anywhere or run locally.
MIT License
269 stars 111 forks source link

Add token cost tracking #402

Open krschacht opened 4 weeks ago

krschacht commented 4 weeks ago

I think a very first PR could consist of: internally track how much every message & conversation $ have incurred so that a user can keep a close eye on their total $ spend this month.

High level:

Off the top of my head, here is how I think an implementation could go:

I doubt this price we are tracking will be perfect so we'll display it as an estimated price to the user. It looks like we may need to do some additional calculations for function calling. This should probably be a subsequent PR, but some notes I've collected:

matthewbennink commented 4 weeks ago

An estimated_input_token_count on the message seems useful, but if I understand correctly, we'll need to also add up all of the token counts of prior messages, assuming all of them are sent.

For example, suppose we send 3 messages, each with 100 tokens, and we get 3 replies, each with 100 tokens, and the system message is 200 tokens. Our first message will be 200+(100+100)=400 tokens, our second message will be 200+(100+100)+(100+100)=600 tokens, and our third message will be 200+(100+100)*3=800 tokens. Given the OpenAI pricing for GPT-4o of $5 for 1M input tokens and $15 for 1M output tokens, and with our 500 input tokens and 300 output tokens, we'll get some cost. Is this your understanding?

As well, when we have very long conversations, there will be a moment when some of the preceding messages may be dropped or summarized to reduce the token usage, or at least get it within the context limit. In other words, the number of tokens in the latest message may not help us fully work out the cost of the request. It's almost like we need to keep track of each individual API request and the number of (estimated) input and output tokens.

If the API can give us the actual number of input tokens, even better, which seems possible with OpenAI if we pass the include_usage streaming option. I don't see any mention of include_usage inside the OpenAI ruby gem git repo (search query), but might be something that can be surfaced there.

krschacht commented 4 weeks ago

@matthewbennink Yes, that's why I was thinking we update estimated_price twice. When we are generating a new message, we pass a newly created message into get_next_message_job (it's persisted to the db already) which, in turn, passes it into ai_backend.

I think the moment that ai_backend is sending it's request to the API, which includes all of the previous messages in the conversation, we can add up the tokens and save the preliminary estimated_cost on the blank message.

Then when the response comes back to get_next_message_job, we do a final message.save and we can do one more token cost estimate and add it onto the estimated_price we previously calculated.

I didn't know about include_usage, that's cool! The OpenAI gem just passes the hash of params that we send straight on to OpenAI, so it should be supported.

matthewbennink commented 4 weeks ago

Do you think it's acceptable to store the estimated price as a float inside the database? It'd be per message, and so they'd all be very small values that added up may include some amount of rounding error. It might just average out in the end and/or it might be fine as an estimate.

The alternative would be to store the token counts, perhaps store the input/prompt token count on the "user" messages and store the output/completion token count on the "assistant" messages. (I'm not sure if we'd need to represent "tool" messages differently. Are there other message roles I'm missing?) The monthly price estimate would then need to find all of those messages, sum the token counts by language model, and multiple each token count by its respective cost. That doesn't seem like it'd be particularly slow. E.g.,

input_cost = Message.user.created_after(Date.beginning_of_month).joins(:assistant => :language_model).sum("messages.token_count * language_models.input_cost_per_1m_tokens_in_millionths_of_cents")
output_cost = Message.assistant.created_after(Date.beginning_of_month).joins(:assistant => :language_model).sum("messages.token_count * language_models.output_cost_per_1m_tokens_in_millionths_of_cents")
total_cost_in_cents = input_cost + output_cost

I'm sure I've gotten some of that wrong, but maybe the idea is there. I've never had to represent small prices before, so struggling a bit there. I figure we want to find a way to represent, e.g. 1B tokens per 1¢ as a limit, and then you can use an integer to represent the cost of X cents per 1B tokens based on today's prices. So, $5 / 1M tokens might be represented as 500000, $.01 / 1M tokens as 1000, and $.00001 / 1M tokens as 1, which seems like a price point we'll never get to.

I also think it'd be perfectly reasonable to store the costs as floats per 1M or 1B tokens and just go from there. So, $5000 / 1B, $10 / 1B, and $.01 / 1B in the examples above.

Given it's just an estimate, it's worth keeping things simple perhaps. But wanted to layout the distinction between storing very small prices per message like .00001 USD vs storing integer token values such as 300.

Once we have a data type, I'd be happy to open up a PR to keep things moving.

krschacht commented 4 weeks ago

@matthewbennink hmm, my instinct is to just store the estimate. I think it should be fine to store it as a float. Is the concern you're raising that the estimate will somehow be worse if we store it as a float? I don't think I understand that. Or maybe what you're suggesting is that there is sound rounding that will inevitably occur by storing small floats which wouldn't occur if we stored tokens? I guess the key question is: what's the accuracy of floats in a postgres table? I'm actually not sure of that. I can't think of a time I had to store tiny fractions of a float. That may be a worth a little bit of investigating.

I think that storing currency amounts rather than tokens will be a bit easier to deal with. It makes it so we can do a really nice query like Message.user.created_after(...).sum(:estimate). It's not like it's a whole lot more complicated to sum up the tokens, but I don't think we otherwise have any need for token counts beyond estimates so I think it's more straightforward to store estimates. Also, there may be multiple places we want to show estimates like maybe if your cost was really high for the month you might want to click to a detailed view and see cost per conversation. (This is super low priority) Or if it's a team account you may want to see cost by user. I think storing the column as the currency value makes it easier to do a whole range of different queries like this.

One small improvement: instead of storing a DOLLAR value store a CENTS value. So maybe the column is named: estimate_in_cents. By shaving off two decimal points we probably get a lot more accuracy and it's easier for us humans to read 0.03 cents than to read 0.0003 dollars.

And I lean towards each message having a single estimate — and that estimate is the cost for generating that whole message (both the input and output tokens required to generate that message). That could also facilitate a future auto-truncation of history when the per-message cost rises above some cutoff.

I don't think you need to think about tool messages any differently than text messages except in one respect:

lumpidu commented 2 weeks ago

I don't fully understand, why we need the cost estimate as a database column. This is a fixed derived value from the token count and the current LLM price. If the LLM provider changes it's pricing structure in the middle of a monthly period , probably one needs more than one simple magic number per LLM, but the token counts are the truth value from which all costs can be derived.

I understand that a per-LLM overall token count (for input/output tokens) could be used as an optimization means so that one doesn't need to calculate all tokens for a certain period on the fly. And this number could be calculated e.g. via a background job after each LLM roundtrip. The OpenAI cost overview and -detail page is also not exactly real-time, so a slight delay between each LLM round-trip and this calculated DB number should be acceptable.

What would also interest me as a regular user of HostedGPT is not only the effective cost, but also the token count itself. Having the possibilty to make this switchable (klicking on the numbers ?) would be really nice. For non-English languages, often the token count is much higher.

krschacht commented 2 weeks ago

@lumpidu Yes, good point on both. We don’t need cost on messages and we could cache things on the LLM.

I agree on seeing token count. It could just be a simple paren like “$14.32 (13,729 tokens)”

krschacht commented 2 days ago

Hi @lumpidu, I wanted to check in on this task and see if you had made any progress on it? And if not, let me know if you're still up for it.

lumpidu commented 1 day ago

@krschacht, probably later this week I will dive into it