Closed Felipe-Amdocs closed 9 months ago
I think this is the expected behavior based on original openai's specs.
https://platform.openai.com/docs/api-reference/chat/streaming https://cookbook.openai.com/examples/how_to_stream_completions
Final chunk will have data: [DONE]
Thanks @aniljava , checking the OpenAI documentation this is the case.
Is there any alternative to calculate the token usage? I checked tiktoken, but seems it is not accurate with Llama2. I also was thinking in create an endpoint on LLamaCPP server to do it for me after I get the full response. I don't have plans to show it to the user, but just keep it in the observability systems.
@Felipe-Amdocs I'm open to adding a seperate tokenize / detokenize endpoint if you write a PR for it.
Hi @abetlen , this is my personal account, same person for @Felipe-Amdocs .
I can work on the PR to provide the new endpoint. Can you give me access?
@felipelo that's awesome, if you'd like to contribute a PR you can fork the repo and open a pull request (or draft PR) here. Just make sure to enable "Maintainers can edit this PR" in the options so I can help you out with anything.
@Felipe-Amdocs I'll close this issue for now as the original issue is resolved, if you'd like to open a new one or a PR for the tokenize endpoints feel free, cheers.
I would also appreciate getting usage stats in the response when streaming, even just in the final chunk would be nice
Prerequisites
Please answer the following questions for yourself before submitting an issue.
Expected Behavior
When running LLamaCPP Server with stream enabled, I don't get the 'usage' field in the responses. However, when I set stream to false, I get it along with the LLM response.
Is that a intended limitation when streaming? If so, there is any other way to calculate it?
Current Behavior
Server is not returning field 'usage' when stream is enabled.
Environment and Context
Running on 0.2.28.
Steps to Reproduce
Response:
However, when stream is true, I get the response token by token but expecting to get one extra json with 'usage' or when the 'finish_reason' is returned, the 'usage' return along.