Closed dongteng closed 5 months ago
看到一个案例,刚好在 LLaMA-Factory 中看到一个案例
import os
import time
from openai import OpenAI
from transformers.utils.versions import require_version
require_version("openai>=1.5.0", "To fix: pip install openai>=1.5.0")
def main():
client = OpenAI(
api_key="0",
base_url="http://localhost:{}/v1".format(os.environ.get("API_PORT", 8000)),
)
messages = [{"role": "user", "content": "Write a long essay about environment protection as long as possible."}]
num_tokens = 0
start_time = time.time()
for _ in range(8):
result = client.chat.completions.create(messages=messages, model="test")
num_tokens += result.usage.completion_tokens
elapsed_time = time.time() - start_time
print("Throughput: {:.2f} tokens/s".format(num_tokens / elapsed_time))
# --infer_backend hf: 27.22 tokens/s (1.0x)
# --infer_backend vllm: 73.03 tokens/s (2.7x)
if __name__ == "__main__":
main()
看到一个案例,刚好在 LLaMA-Factory 中看到一个案例
import os import time
from openai import OpenAI from transformers.utils.versions import require_version
require_version("openai>=1.5.0", "To fix: pip install openai>=1.5.0")
def main(): client = OpenAI( api_key="0", base_url="[http://localhost:{}/v1".format(os.environ.get("API_PORT](http://localhost:%7B%7D/v1%22.format(os.environ.get(%22API_PORT)", 8000)), ) messages = [{"role": "user", "content": "Write a long essay about environment protection as long as possible."}] num_tokens = 0 starttime = time.time() for in range(8): result = client.chat.completions.create(messages=messages, model="test") num_tokens += result.usage.completion_tokens
elapsed_time = time.time() - start_time print("Throughput: {:.2f} tokens/s".format(num_tokens / elapsed_time)) # --infer_backend hf: 27.22 tokens/s (1.0x) # --infer_backend vllm: 73.03 tokens/s (2.7x)
if name == "main": main()
非流式是带有token统计的,
我的流式输出是这种:
tiktoken
is a package for tokenization. The same thing can be done by Qwen2Tokenizer
in transformers
. Refer to the README on how to initialize the tokenizer.query, response = ...
prompt_tokens = len(tokenizer.apply_chat_template({"messages": [{"role": "user", "content": query}}], add_generation_prompt=True, tokenize=True).input_ids)
completion_tokens = len(tokenizer(response).input_ids) + 1
stream
option, the response is returned token by token in vLLM. The token usage is always sent via the second last chunk (the vllm code). ChatOpenAI
is from langchain
. It DOES NOT support this feature. The pull request has been merged but is not in a released version. Refer to the README on how to use the openai
package or waiting for a new release of langchain.In all, you have just asked all the wrong questions. It is not related to vLLM or Qwen or tiktoken or the tokenizer used by Qwen.
openai的模型可使用tiktoken这个包 请问qwen模型该如何统计流式输出的token数量呢?
llm2 = ChatOpenAI( model="/vllm/qwen1.5-chat-moe", openai_api_key = 'xxx', openai_api_base= 'http://localhost:8000/v1', streaming = True, tiktoken_model_name = "/vllm/qwen1.5-chat-moe" ) async for event in llm2.astream(input="你好"): print(event)