Open leeaction opened 1 month ago
perhaps you need add stop_tokens.
stop_token_ids = [tokenizer.eos_id, tokenizer.eot_id]
sampling_params = SamplingParams(
stop_token_ids=stop_token_ids,
)
是否已有关于该错误的issue或讨论? | Is there an existing issue / discussion for this?
- [x] 我已经搜索过已有的issues和讨论 | I have searched the existing issues / discussions
该问题是否在FAQ中有解答? | Is there an existing answer for this in FAQ?
- [x] 我已经搜索过FAQ | I have searched FAQ
当前行为 | Current Behavior
使用vLLM 运行MiniCPM-Llama3-V-2_5, 启动后台服务后,使用OpenAI chat API,返回的推理结果里有大量 <|eot_id|>
......'FINISH'.<|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|>3<|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|>FINISH<|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|>
期望行为 | Expected Behavior
不返回无用信息
复现方法 | Steps To Reproduce
No response
运行环境 | Environment
- OS:Ubuntu 20.04 - Python:3.10 - Transformers:4.43.3 - PyTorch:2.3.1 - CUDA (`python -c 'import torch; print(torch.version.cuda)'`):12.1
备注 | Anything else?
No response
你好,我想请教一下,怎么配合和vllm使用呢,有相关的代码或者教程吗,没有找到相关的代码
感谢您的回复,但我现在使用这种方式启动vllm,CUDA_VISIBLE_DEVICES=2 python -m vllm.entrypoints.openai.api_server \ --model /home/nlp/xc/NLP/LLM/openLLM/MiniCPM-Llama3-V-2_5 \ --tensor-parallel-size=2 \ --port 8088 \ --trust-remote-code,他提示报错说不支持这个模型,我是通过pip install vllm安装的,vllm版本是 0.5.3.post1,请问你又遇到这个情况吗
感谢您的回复,但我现在使用这种方式启动vllm,CUDA_VISIBLE_DEVICES=2 python -m vllm.entrypoints.openai.api_server \ --model /home/nlp/xc/NLP/LLM/openLLM/MiniCPM-Llama3-V-2_5 \ --tensor-parallel-size=2 \ --port 8088 \ --trust-remote-code,他提示报错说不支持这个模型,我是通过pip install vllm安装的,vllm版本是 0.5.3.post1,请问你又遇到这个情况吗
vllm 还没发布官方 tag,需要从源码安装
感谢您的回复,但我现在使用这种方式启动vllm,CUDA_VISIBLE_DEVICES=2 python -m vllm.entrypoints.openai.api_server \ --model /home/nlp/xc/NLP/LLM/openLLM/MiniCPM-Llama3-V-2_5 \ --tensor-parallel-size=2 \ --port 8088 \ --trust-remote-code,他提示报错说不支持这个模型,我是通过pip install vllm安装的,vllm版本是 0.5.3.post1,请问你又遇到这个情况吗
vllm 还没发布官方 tag,需要从源码安装
预计啥时候呢,很期待
遇到了类似的问题,采用了stop_tokens_id 后,原先的eot token消失了,但仍然生成了大量的空格。似乎每次推理一定要生成到限定的max token数才会停止。请问一下这个该怎么解决呢 @whyiug
可以试试这这份代码(包含2.0、2.5、2.6)
from transformers import AutoTokenizer
from PIL import Image
from vllm import LLM, SamplingParams
MODEL_NAME = "openbmb/MiniCPM-V-2_6"
# Also available for previous models
# MODEL_NAME = "openbmb/MiniCPM-Llama3-V-2_5"
# MODEL_NAME = "HwwwH/MiniCPM-V-2"
image = Image.open("xxx.png").convert("RGB")
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True)
llm = LLM(
model=MODEL_NAME,
trust_remote_code=True,
gpu_memory_utilization=1,
max_model_len=2048
)
messages = [{
"role":
"user",
"content":
# Number of images
"(<image>./</image>)" + \
"\nWhat is the content of this image?"
}]
prompt = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
# Single Inference
inputs = {
"prompt": prompt,
"multi_modal_data": {
"image": image
# Multi images, the number of images should be equal to that of `(<image>./</image>)`
# "image": [image, image]
},
}
# Batch Inference
# inputs = [{
# "prompt": prompt,
# "multi_modal_data": {
# "image": image
# },
# } for _ in 2]
# 2.6
stop_tokens = ['<|im_end|>', '<|endoftext|>']
stop_token_ids = [tokenizer.convert_tokens_to_ids(i) for i in stop_tokens]
# 2.0
# stop_token_ids = [tokenizer.eos_id]
# 2.5
# stop_token_ids = [tokenizer.eos_id, tokenizer.eot_id]
sampling_params = SamplingParams(
stop_token_ids=stop_token_ids,
use_beam_search=True,
temperature=0,
best_of=3,
max_tokens=64
)
outputs = llm.generate(inputs, sampling_params=sampling_params)
print(outputs[0].outputs[0].text)
感谢您的回复,但我现在使用这种方式启动vllm,CUDA_VISIBLE_DEVICES=2 python -m vllm.entrypoints.openai.api_server \ --model /home/nlp/xc/NLP/LLM/openLLM/MiniCPM-Llama3-V-2_5 \ --tensor-parallel-size=2 \ --port 8088 \ --trust-remote-code,他提示报错说不支持这个模型,我是通过pip install vllm安装的,vllm版本是 0.5.3.post1,请问你又遇到这个情况吗
感谢您的回复,但我现在使用这种方式启动vllm,CUDA_VISIBLE_DEVICES=2 python -m vllm.entrypoints.openai.api_server \ --model /home/nlp/xc/NLP/LLM/openLLM/MiniCPM-Llama3-V-2_5 \ --tensor-parallel-size=2 \ --port 8088 \ --trust-remote-code,他提示报错说不支持这个模型,我是通过pip install vllm安装的,vllm版本是 0.5.3.post1,请问你又遇到这个情况吗
vllm 还没发布官方 tag,需要从源码安装
预计啥时候呢,很期待
昨天 vllm
他们刚发布了 0.5.4,这个是有 MiniCPM-V
的。
感谢您的回复,但我现在使用这种方式启动vllm,CUDA_VISIBLE_DEVICES=2 python -m vllm.entrypoints.openai.api_server \ --model /home/nlp/xc/NLP/LLM/openLLM/MiniCPM-Llama3-V-2_5 \ --tensor-parallel-size=2 \ --port 8088 \ --trust-remote-code,他提示报错说不支持这个模型,我是通过pip install vllm安装的,vllm版本是 0.5.3.post1,请问你又遇到这个情况吗
感谢您的回复,但我现在使用这种方式启动vllm,CUDA_VISIBLE_DEVICES=2 python -m vllm.entrypoints.openai.api_server \ --model /home/nlp/xc/NLP/LLM/openLLM/MiniCPM-Llama3-V-2_5 \ --tensor-parallel-size=2 \ --port 8088 \ --trust-remote-code,他提示报错说不支持这个模型,我是通过pip install vllm安装的,vllm版本是 0.5.3.post1,请问你又遇到这个情况吗
vllm 还没发布官方 tag,需要从源码安装
预计啥时候呢,很期待
昨天
vllm
他们刚发布了 0.5.4,这个是有MiniCPM-V
的。
vllm serve的形式好像不支持hf下载的量化模型权重,原始的倒是支持。只不过在推理时候看起来是一直要生成到max tokens 的数值才会停止?这个有什么好办法解决吗
vllm serve的形式好像不支持hf下载的量化模型权重,原始的倒是支持。只不过在推理时候看起来是一直要生成到max tokens 的数值才会停止?这个有什么好办法解决吗
量化的可能需要再等等哈,原始的可以在请求的时候加 stop_token_ids,以下是一个2.5的请求样例
chat_response = client.chat.completions.create(
model="openbmb/MiniCPM-Llama3-V-2_5",
messages=[{
"role": "user",
"content": [
# NOTE: The prompt formatting with the image token `<image>` is not needed
# since the prompt will be processed automatically by the API server.
{"type": "text", "text": "What's in this image?"},
{
"type": "image_url",
"image_url": {
"url": "https://air-example-data-2.s3.us-west-2.amazonaws.com/vllm_opensource_llava/stop_sign.jpg",
},
},
],
}],
extra_body={
"stop_token_ids": [128009, 128001]
}
)
如果你要使用 2.6 的话,stop_token_ids 应该是 [151645, 151643]
vllm serve的形式好像不支持hf下载的量化模型权重,原始的倒是支持。只不过在推理时候看起来是一直要生成到max tokens 的数值才会停止?这个有什么好办法解决吗
量化的可能需要再等等哈,原始的可以在请求的时候加 stop_token_ids,以下是一个2.5的请求样例
chat_response = client.chat.completions.create( model="openbmb/MiniCPM-Llama3-V-2_5", messages=[{ "role": "user", "content": [ # NOTE: The prompt formatting with the image token `<image>` is not needed # since the prompt will be processed automatically by the API server. {"type": "text", "text": "What's in this image?"}, { "type": "image_url", "image_url": { "url": "https://air-example-data-2.s3.us-west-2.amazonaws.com/vllm_opensource_llava/stop_sign.jpg", }, }, ], }], extra_body={ "stop_token_ids": [128009, 128001] } )
如果你要使用 2.6 的话,stop_token_ids 应该是
[151645, 151643]
我的请求代码是这样的。目前在output之后还会跟着好多好多换行符,即便加了stop_token_ids也不太顶用。就很奇怪。。。给我的感觉就是默认一定要生成到max token数额的token才会停止推理。
chat_response = client.chat.completions.create(
model="MiniCPM-Llama3-V-2_5",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": text_prompt},
{
"type": "image_url",
"image_url": {
"url": image_url,
},
},
],
}],
response_format={'type': 'json_object'},
extra_body={'stop_token_ids':[128009, 128001] , 'max_tokens': 512}
)
我的请求代码是这样的。目前在output之后还会跟着好多好多换行符,即便加了stop_token_ids也不太顶用。就很奇怪。。。给我的感觉就是默认一定要生成到max token数额的token才会停止推理。
chat_response = client.chat.completions.create( model="MiniCPM-Llama3-V-2_5", messages=[{ "role": "user", "content": [ {"type": "text", "text": text_prompt}, { "type": "image_url", "image_url": { "url": image_url, }, }, ], }], response_format={'type': 'json_object'}, extra_body={'stop_token_ids':[128009, 128001] , 'max_tokens': 512} )
我尝试了一下,只要加上
response_format
这一行就会出现和你一样的问题,否则是正常的。
我遇到同样的问题哦
是否已有关于该错误的issue或讨论? | Is there an existing issue / discussion for this?
该问题是否在FAQ中有解答? | Is there an existing answer for this in FAQ?
当前行为 | Current Behavior
使用vLLM 运行MiniCPM-Llama3-V-2_5, 启动后台服务后,使用OpenAI chat API,返回的推理结果里有大量 <|eot_id|>
......'FINISH'.<|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|>3<|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|>FINISH<|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|>
期望行为 | Expected Behavior
不返回无用信息
复现方法 | Steps To Reproduce
No response
运行环境 | Environment
备注 | Anything else?
No response