Closed ZachZimm closed 3 weeks ago
Thank you very much for the great PR, @ZachZimm!
- This streaming implementation differs slightly from openai's in that their api accepts a stream_options: dict parameter which can include "include_usage"=True, otherwise it will not stream back usage information.
- Additionally, they put forth a few rules about which chunks of the stream will contain usage data.
Can we implement this as well?
We need 1, to make it easier to define when we need it.
And 2 is a simple implementation, all you need to do is to pass the usage after streaming all tokens and just before yielding done (i.e., https://github.com/Blaizzy/fastmlx/blob/main/fastmlx/utils.py#L407).
Please run pre-commit to fix styling check errors.
I pushed a new commit based on your feedback. I did run pre-commit this time (sorry, I'd never heard of it before) and there shouldn't be any extra comments or prints.
@ZachZimm please update the tests to support token info, this is what causing them to fail.
This PR resolves issue #8 by adding token usage information to the response messages. The response conforms to the schema I found here for non-streaming and here for streaming, at the very bottom. In this implementation, all streamed chunks include up-to-date information about token usage.
This streaming implementation differs slightly from openai's in that their api accepts a
stream_options: dict
parameter which can include"include_usage"=True
, otherwise it will not stream back usage information. Additionally, they put forth a few rules about which chunks of the stream will contain usage data.I did not add a
count_tokens
function as suggested in the issue, as it seems better to do this during generation since we are already using the tokenizer and in the case of non-streaming we already had an array of prompt tokens.