When completions streaming mode enabled, the response no longer includes the `usage` field.

Clear and concise description of the problem

As the official cookbook How to stream completions cites:

Another small drawback of streaming responses is that the response no longer includes the usage field to tell you how many tokens were consumed. After receiving and combining all of the responses, you can calculate this yourself using tiktoken.

Personally, I think it would be useful to implement that feature. Users wouldn't have to check the daily usage breakdown on their account page, and it would make for a more responsive and user-friendly experience.

Alternative

Maybe there is a way to implement it on the back-end by providing an API, but I have not succeeded in achieving that so far because it seems impossible to load a wasm file when deploying on Vercel. I have followed the tutorial on Vercel docs and tried some plugins to load the wasm file but failed. If anyone knows about this, please let me know! 😁

Additional context

GIF 2023-3-18 2-14-08

I have not optimized my code, but it suffices for now. There are some bugs, as shown below:

7_local_1__

The first completion is primed with \n\n, and 20 tokens are used. After conducting some tests, I have observed that the number of tokens of the completion seems to be equal to the number of tokens of the completion content only, indicating that the special tokens and line breaks are not included in the count (Please refer to the code for more details).

7_local_1_

The second completion has exactly the same content as the first one, but is not primed with \n\n. As \n\n is encoded in 271, it indicates that one token is used. Therefore, the result is 19, which is exactly what we expected.

But the paradox is that

7_remote

The daily usage gives me both 19. I have no ideas about this, it requires further testing.

If you know about this, please let me know! I would appreciate it.

In addition, I feel that my implementation method is still quite rough and only supports the 'gpt-3.5' model. I have not tested it on other models. Also, if you have any advice, please let me know too.

Validations

[X] Follow our Code of Conduct
[X] Read the Contributing Guide.
[X] Check that there isn't already an issue that reports the same bug to avoid creating a duplicate.

Thank you a lot @Hime-Hina! Inspired by your demo, I integrated the latest tiktoken library with my chatgpt-demo fork, for which you can view a live demo here. I found some conclusions related to token counting.

For conclusion, the pseudo formula can be represented as:

$$ \begin{align} \text{prompt tokens} &= \sum_{\texttt{msg}}\texttt{( encode(msg).length+4 ) + 3}\ \text{completion tokens} &= \texttt{encode(msg).length} \end{align} $$

I've compared the token count in the API response's Header with the I calculated myself using Python and JavaScript respectively, and found that there is no issue. (Mention that I interestingly found that in fact the official tokenizer demo is a GPT-3 tokenizer, which encodes Chinese letters much worse than gpt-3.5-turbo's)

OpenAI also has a note of the markup language they created for conversations.

As you said, trying to make WASM work on edge functions is incredibly tough. I almost spent half a day with bugs. In the end, I found that this way works well in self-host route, which is similar to your solution in the dev branch of your demo repo. But this don't work on Edge Functions of Vercel or Netlify (yes, serverless functions work, but they can't stream responses). Finally I use fetch to load wasm and use dynamic import to solve this.

You can view my implementation through the following pages:

Component that works for self-host, Component that works on edges too
token counter utils
Netlify deploy Vercel deploy self-host deploy

anse-app / chatgpt-demo