vllm部署qwen如何统计token数量呢？

dongteng commented 5 months ago

openai的模型可使用tiktoken这个包请问qwen模型该如何统计流式输出的token数量呢？

llm2 = ChatOpenAI( model="/vllm/qwen1.5-chat-moe", openai_api_key = 'xxx', openai_api_base= 'http://localhost:8000/v1', streaming = True, tiktoken_model_name = "/vllm/qwen1.5-chat-moe" ) async for event in llm2.astream(input="你好"): print(event)

huangguifeng commented 5 months ago

看到一个案例，刚好在 LLaMA-Factory 中看到一个案例

import os
import time

from openai import OpenAI
from transformers.utils.versions import require_version

require_version("openai>=1.5.0", "To fix: pip install openai>=1.5.0")

def main():
    client = OpenAI(
        api_key="0",
        base_url="http://localhost:{}/v1".format(os.environ.get("API_PORT", 8000)),
    )
    messages = [{"role": "user", "content": "Write a long essay about environment protection as long as possible."}]
    num_tokens = 0
    start_time = time.time()
    for _ in range(8):
        result = client.chat.completions.create(messages=messages, model="test")
        num_tokens += result.usage.completion_tokens

    elapsed_time = time.time() - start_time
    print("Throughput: {:.2f} tokens/s".format(num_tokens / elapsed_time))
    # --infer_backend hf: 27.22 tokens/s (1.0x)
    # --infer_backend vllm: 73.03 tokens/s (2.7x)

if __name__ == "__main__":
    main()

dongteng commented 5 months ago

看到一个案例，刚好在 LLaMA-Factory 中看到一个案例

import os import time

from openai import OpenAI from transformers.utils.versions import require_version

require_version("openai>=1.5.0", "To fix: pip install openai>=1.5.0")

def main(): client = OpenAI( api_key="0", base_url="[http://localhost:{}/v1".format(os.environ.get("API_PORT](http://localhost:%7B%7D/v1%22.format(os.environ.get(%22API_PORT)", 8000)), ) messages = [{"role": "user", "content": "Write a long essay about environment protection as long as possible."}] num_tokens = 0 starttime = time.time() for in range(8): result = client.chat.completions.create(messages=messages, model="test") num_tokens += result.usage.completion_tokens
elapsed_time = time.time() - start_time
print("Throughput: {:.2f} tokens/s".format(num_tokens / elapsed_time))
# --infer_backend hf: 27.22 tokens/s (1.0x)
# --infer_backend vllm: 73.03 tokens/s (2.7x)
if name == "main": main()

非流式是带有token统计的，

我的流式输出是这种：

jklj077 commented 5 months ago

tiktoken is a package for tokenization. The same thing can be done by Qwen2Tokenizer in transformers. Refer to the README on how to initialize the tokenizer.

Simply applying tokenization to your query and response will not produce the same results as vLLM, because chat templates are involved. Also refer to the README on how to apply the chat template.

query, response = ...
prompt_tokens = len(tokenizer.apply_chat_template({"messages": [{"role": "user", "content": query}}], add_generation_prompt=True, tokenize=True).input_ids)
completion_tokens = len(tokenizer(response).input_ids) + 1

If you enable the stream option, the response is returned token by token in vLLM. The token usage is always sent via the second last chunk (the vllm code).
However, ChatOpenAI is from langchain. It DOES NOT support this feature. The pull request has been merged but is not in a released version. Refer to the README on how to use the openai package or waiting for a new release of langchain.

In all, you have just asked all the wrong questions. It is not related to vLLM or Qwen or tiktoken or the tokenizer used by Qwen.

QwenLM / Qwen2.5

vllm部署qwen如何统计token数量呢？ #444