abetlen / llama-cpp-python

Python bindings for llama.cpp
https://llama-cpp-python.readthedocs.io
MIT License
8.12k stars 964 forks source link

Llama CPP Server Not Returning 'usage' when 'stream' is Enabled #1082

Closed Felipe-Amdocs closed 9 months ago

Felipe-Amdocs commented 10 months ago

Prerequisites

Please answer the following questions for yourself before submitting an issue.

Expected Behavior

When running LLamaCPP Server with stream enabled, I don't get the 'usage' field in the responses. However, when I set stream to false, I get it along with the LLM response.

Is that a intended limitation when streaming? If so, there is any other way to calculate it?

Current Behavior

Server is not returning field 'usage' when stream is enabled.

Environment and Context

Running on 0.2.28.

Steps to Reproduce

curl --request POST \
  --url http://localhost:8000/v1/chat/completions \
  --header 'Content-Type: application/json' \
  --data '{
    "stream": false,
    "messages": [
        {
            "content": "What is tallest building?",
            "role": "user"
        }
    ]
}'

Response:

{
    "id": "...",
    "object": "chat.completion",
    "created": 1704992296,
    "model": "...",
    "choices": [
        {
            "index": 0,
            "message": {
                "content": "The tallest......",
                "role": "assistant"
            },
            "finish_reason": "stop"
        }
    ],
    "usage": {
        "prompt_tokens": 145,
        "completion_tokens": 388,
        "total_tokens": 533
    }
}

However, when stream is true, I get the response token by token but expecting to get one extra json with 'usage' or when the 'finish_reason' is returned, the 'usage' return along.

aniljava commented 10 months ago

I think this is the expected behavior based on original openai's specs.

https://platform.openai.com/docs/api-reference/chat/streaming https://cookbook.openai.com/examples/how_to_stream_completions

Final chunk will have data: [DONE]

Felipe-Amdocs commented 10 months ago

Thanks @aniljava , checking the OpenAI documentation this is the case.

Is there any alternative to calculate the token usage? I checked tiktoken, but seems it is not accurate with Llama2. I also was thinking in create an endpoint on LLamaCPP server to do it for me after I get the full response. I don't have plans to show it to the user, but just keep it in the observability systems.

abetlen commented 10 months ago

@Felipe-Amdocs I'm open to adding a seperate tokenize / detokenize endpoint if you write a PR for it.

felipelo commented 10 months ago

Hi @abetlen , this is my personal account, same person for @Felipe-Amdocs .

I can work on the PR to provide the new endpoint. Can you give me access?

abetlen commented 9 months ago

@felipelo that's awesome, if you'd like to contribute a PR you can fork the repo and open a pull request (or draft PR) here. Just make sure to enable "Maintainers can edit this PR" in the options so I can help you out with anything.

abetlen commented 9 months ago

@Felipe-Amdocs I'll close this issue for now as the original issue is resolved, if you'd like to open a new one or a PR for the tokenize endpoints feel free, cheers.

JettScythe commented 8 months ago

I would also appreciate getting usage stats in the response when streaming, even just in the final chunk would be nice