Implement Token Usage Tracking

ZachZimm commented 3 weeks ago

This PR resolves issue #8 by adding token usage information to the response messages. The response conforms to the schema I found here for non-streaming and here for streaming, at the very bottom. In this implementation, all streamed chunks include up-to-date information about token usage.

This streaming implementation differs slightly from openai's in that their api accepts a stream_options: dict parameter which can include "include_usage"=True, otherwise it will not stream back usage information. Additionally, they put forth a few rules about which chunks of the stream will contain usage data.

I did not add a count_tokens function as suggested in the issue, as it seems better to do this during generation since we are already using the tokenizer and in the case of non-streaming we already had an array of prompt tokens.

Blaizzy commented 3 weeks ago

Thank you very much for the great PR, @ZachZimm!

Blaizzy commented 3 weeks ago

This streaming implementation differs slightly from openai's in that their api accepts a stream_options: dict parameter which can include "include_usage"=True, otherwise it will not stream back usage information.

Additionally, they put forth a few rules about which chunks of the stream will contain usage data.

Can we implement this as well?

We need 1, to make it easier to define when we need it.

And 2 is a simple implementation, all you need to do is to pass the usage after streaming all tokens and just before yielding done (i.e., https://github.com/Blaizzy/fastmlx/blob/main/fastmlx/utils.py#L407).

Blaizzy commented 3 weeks ago

Please run pre-commit to fix styling check errors.

ZachZimm commented 2 weeks ago

I pushed a new commit based on your feedback. I did run pre-commit this time (sorry, I'd never heard of it before) and there shouldn't be any extra comments or prints.

Blaizzy commented 3 days ago

@ZachZimm please update the tests to support token info, this is what causing them to fail.

Blaizzy / fastmlx

Implement Token Usage Tracking #27