OpenBMB / MiniCPM-V

MiniCPM-V 2.6: A GPT-4V Level MLLM for Single Image, Multi Image and Video on Your Phone
Apache License 2.0
11.83k stars 831 forks source link

[BUG] <使用vLLM 运行MiniCPM-Llama3-V-2_5, 启动后台服务后,使用OpenAI chat API,返回的推理结果里有大量 <|eot_id|>> #373

Open leeaction opened 1 month ago

leeaction commented 1 month ago

是否已有关于该错误的issue或讨论? | Is there an existing issue / discussion for this?

该问题是否在FAQ中有解答? | Is there an existing answer for this in FAQ?

当前行为 | Current Behavior

使用vLLM 运行MiniCPM-Llama3-V-2_5, 启动后台服务后,使用OpenAI chat API,返回的推理结果里有大量 <|eot_id|>

......'FINISH'.<|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|>3<|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|>FINISH<|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|>

期望行为 | Expected Behavior

不返回无用信息

复现方法 | Steps To Reproduce

No response

运行环境 | Environment

- OS:Ubuntu 20.04
- Python:3.10
- Transformers:4.43.3
- PyTorch:2.3.1
- CUDA (`python -c 'import torch; print(torch.version.cuda)'`):12.1

备注 | Anything else?

No response

whyiug commented 1 month ago

perhaps you need add stop_tokens.


stop_token_ids = [tokenizer.eos_id, tokenizer.eot_id]
sampling_params = SamplingParams(
stop_token_ids=stop_token_ids,
)
1223243 commented 1 month ago

是否已有关于该错误的issue或讨论? | Is there an existing issue / discussion for this?

  • [x] 我已经搜索过已有的issues和讨论 | I have searched the existing issues / discussions

该问题是否在FAQ中有解答? | Is there an existing answer for this in FAQ?

  • [x] 我已经搜索过FAQ | I have searched FAQ

当前行为 | Current Behavior

使用vLLM 运行MiniCPM-Llama3-V-2_5, 启动后台服务后,使用OpenAI chat API,返回的推理结果里有大量 <|eot_id|>

......'FINISH'.<|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|>3<|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|>FINISH<|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|>

期望行为 | Expected Behavior

不返回无用信息

复现方法 | Steps To Reproduce

No response

运行环境 | Environment

- OS:Ubuntu 20.04
- Python:3.10
- Transformers:4.43.3
- PyTorch:2.3.1
- CUDA (`python -c 'import torch; print(torch.version.cuda)'`):12.1

备注 | Anything else?

No response

你好,我想请教一下,怎么配合和vllm使用呢,有相关的代码或者教程吗,没有找到相关的代码

whyiug commented 1 month ago

@1223243 https://github.com/vllm-project/vllm/blob/cf2a1a4d9d8168d2e8e7bef244c1dfec80780405/examples/offline_inference_vision_language.py#L83C1-L84C1

1223243 commented 1 month ago

@1223243 https://github.com/vllm-project/vllm/blob/cf2a1a4d9d8168d2e8e7bef244c1dfec80780405/examples/offline_inference_vision_language.py#L83C1-L84C1

感谢您的回复,但我现在使用这种方式启动vllm,CUDA_VISIBLE_DEVICES=2 python -m vllm.entrypoints.openai.api_server \ --model /home/nlp/xc/NLP/LLM/openLLM/MiniCPM-Llama3-V-2_5 \ --tensor-parallel-size=2 \ --port 8088 \ --trust-remote-code,他提示报错说不支持这个模型,我是通过pip install vllm安装的,vllm版本是 0.5.3.post1,请问你又遇到这个情况吗

whyiug commented 1 month ago

@1223243 https://github.com/vllm-project/vllm/blob/cf2a1a4d9d8168d2e8e7bef244c1dfec80780405/examples/offline_inference_vision_language.py#L83C1-L84C1

感谢您的回复,但我现在使用这种方式启动vllm,CUDA_VISIBLE_DEVICES=2 python -m vllm.entrypoints.openai.api_server \ --model /home/nlp/xc/NLP/LLM/openLLM/MiniCPM-Llama3-V-2_5 \ --tensor-parallel-size=2 \ --port 8088 \ --trust-remote-code,他提示报错说不支持这个模型,我是通过pip install vllm安装的,vllm版本是 0.5.3.post1,请问你又遇到这个情况吗

vllm 还没发布官方 tag,需要从源码安装

1223243 commented 1 month ago

@1223243 https://github.com/vllm-project/vllm/blob/cf2a1a4d9d8168d2e8e7bef244c1dfec80780405/examples/offline_inference_vision_language.py#L83C1-L84C1

感谢您的回复,但我现在使用这种方式启动vllm,CUDA_VISIBLE_DEVICES=2 python -m vllm.entrypoints.openai.api_server \ --model /home/nlp/xc/NLP/LLM/openLLM/MiniCPM-Llama3-V-2_5 \ --tensor-parallel-size=2 \ --port 8088 \ --trust-remote-code,他提示报错说不支持这个模型,我是通过pip install vllm安装的,vllm版本是 0.5.3.post1,请问你又遇到这个情况吗

vllm 还没发布官方 tag,需要从源码安装

预计啥时候呢,很期待

another1s commented 1 month ago

遇到了类似的问题,采用了stop_tokens_id 后,原先的eot token消失了,但仍然生成了大量的空格。似乎每次推理一定要生成到限定的max token数才会停止。请问一下这个该怎么解决呢 @whyiug

HwwwwwwwH commented 1 month ago

可以试试这这份代码(包含2.0、2.5、2.6)


from transformers import AutoTokenizer
from PIL import Image
from vllm import LLM, SamplingParams

MODEL_NAME = "openbmb/MiniCPM-V-2_6"
# Also available for previous models
# MODEL_NAME = "openbmb/MiniCPM-Llama3-V-2_5"
# MODEL_NAME = "HwwwH/MiniCPM-V-2"

image = Image.open("xxx.png").convert("RGB")
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True)
llm = LLM(
    model=MODEL_NAME,
    trust_remote_code=True,
    gpu_memory_utilization=1,
    max_model_len=2048
)

messages = [{
    "role":
    "user",
    "content":
    # Number of images
    "(<image>./</image>)" + \
    "\nWhat is the content of this image?" 
}]
prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

# Single Inference
inputs = {
    "prompt": prompt,
    "multi_modal_data": {
        "image": image
        # Multi images, the number of images should be equal to that of `(<image>./</image>)`
        # "image": [image, image] 
    },
}
# Batch Inference
# inputs = [{
#     "prompt": prompt,
#     "multi_modal_data": {
#         "image": image
#     },
# } for _ in 2]

# 2.6
stop_tokens = ['<|im_end|>', '<|endoftext|>']
stop_token_ids = [tokenizer.convert_tokens_to_ids(i) for i in stop_tokens]
# 2.0
# stop_token_ids = [tokenizer.eos_id]
# 2.5
# stop_token_ids = [tokenizer.eos_id, tokenizer.eot_id]

sampling_params = SamplingParams(
    stop_token_ids=stop_token_ids, 
    use_beam_search=True,
    temperature=0, 
    best_of=3,
    max_tokens=64
)

outputs = llm.generate(inputs, sampling_params=sampling_params)

print(outputs[0].outputs[0].text)
HwwwwwwwH commented 1 month ago

@1223243 https://github.com/vllm-project/vllm/blob/cf2a1a4d9d8168d2e8e7bef244c1dfec80780405/examples/offline_inference_vision_language.py#L83C1-L84C1

感谢您的回复,但我现在使用这种方式启动vllm,CUDA_VISIBLE_DEVICES=2 python -m vllm.entrypoints.openai.api_server \ --model /home/nlp/xc/NLP/LLM/openLLM/MiniCPM-Llama3-V-2_5 \ --tensor-parallel-size=2 \ --port 8088 \ --trust-remote-code,他提示报错说不支持这个模型,我是通过pip install vllm安装的,vllm版本是 0.5.3.post1,请问你又遇到这个情况吗

@1223243 https://github.com/vllm-project/vllm/blob/cf2a1a4d9d8168d2e8e7bef244c1dfec80780405/examples/offline_inference_vision_language.py#L83C1-L84C1

感谢您的回复,但我现在使用这种方式启动vllm,CUDA_VISIBLE_DEVICES=2 python -m vllm.entrypoints.openai.api_server \ --model /home/nlp/xc/NLP/LLM/openLLM/MiniCPM-Llama3-V-2_5 \ --tensor-parallel-size=2 \ --port 8088 \ --trust-remote-code,他提示报错说不支持这个模型,我是通过pip install vllm安装的,vllm版本是 0.5.3.post1,请问你又遇到这个情况吗

vllm 还没发布官方 tag,需要从源码安装

预计啥时候呢,很期待

昨天 vllm 他们刚发布了 0.5.4,这个是有 MiniCPM-V 的。

another1s commented 1 month ago

@1223243 https://github.com/vllm-project/vllm/blob/cf2a1a4d9d8168d2e8e7bef244c1dfec80780405/examples/offline_inference_vision_language.py#L83C1-L84C1

感谢您的回复,但我现在使用这种方式启动vllm,CUDA_VISIBLE_DEVICES=2 python -m vllm.entrypoints.openai.api_server \ --model /home/nlp/xc/NLP/LLM/openLLM/MiniCPM-Llama3-V-2_5 \ --tensor-parallel-size=2 \ --port 8088 \ --trust-remote-code,他提示报错说不支持这个模型,我是通过pip install vllm安装的,vllm版本是 0.5.3.post1,请问你又遇到这个情况吗

@1223243 https://github.com/vllm-project/vllm/blob/cf2a1a4d9d8168d2e8e7bef244c1dfec80780405/examples/offline_inference_vision_language.py#L83C1-L84C1

感谢您的回复,但我现在使用这种方式启动vllm,CUDA_VISIBLE_DEVICES=2 python -m vllm.entrypoints.openai.api_server \ --model /home/nlp/xc/NLP/LLM/openLLM/MiniCPM-Llama3-V-2_5 \ --tensor-parallel-size=2 \ --port 8088 \ --trust-remote-code,他提示报错说不支持这个模型,我是通过pip install vllm安装的,vllm版本是 0.5.3.post1,请问你又遇到这个情况吗

vllm 还没发布官方 tag,需要从源码安装

预计啥时候呢,很期待

昨天 vllm 他们刚发布了 0.5.4,这个是有 MiniCPM-V 的。

vllm serve的形式好像不支持hf下载的量化模型权重,原始的倒是支持。只不过在推理时候看起来是一直要生成到max tokens 的数值才会停止?这个有什么好办法解决吗

HwwwwwwwH commented 1 month ago

vllm serve的形式好像不支持hf下载的量化模型权重,原始的倒是支持。只不过在推理时候看起来是一直要生成到max tokens 的数值才会停止?这个有什么好办法解决吗

量化的可能需要再等等哈,原始的可以在请求的时候加 stop_token_ids,以下是一个2.5的请求样例

chat_response = client.chat.completions.create(
    model="openbmb/MiniCPM-Llama3-V-2_5",
    messages=[{
        "role": "user",
        "content": [
            # NOTE: The prompt formatting with the image token `<image>` is not needed
            # since the prompt will be processed automatically by the API server.
            {"type": "text", "text": "What's in this image?"},
            {
                "type": "image_url",
                "image_url": {
                    "url": "https://air-example-data-2.s3.us-west-2.amazonaws.com/vllm_opensource_llava/stop_sign.jpg",
                },
            },
        ],
    }],
    extra_body={
        "stop_token_ids": [128009, 128001]
    }
)

如果你要使用 2.6 的话,stop_token_ids 应该是 [151645, 151643]

another1s commented 1 month ago

vllm serve的形式好像不支持hf下载的量化模型权重,原始的倒是支持。只不过在推理时候看起来是一直要生成到max tokens 的数值才会停止?这个有什么好办法解决吗

量化的可能需要再等等哈,原始的可以在请求的时候加 stop_token_ids,以下是一个2.5的请求样例

chat_response = client.chat.completions.create(
    model="openbmb/MiniCPM-Llama3-V-2_5",
    messages=[{
        "role": "user",
        "content": [
            # NOTE: The prompt formatting with the image token `<image>` is not needed
            # since the prompt will be processed automatically by the API server.
            {"type": "text", "text": "What's in this image?"},
            {
                "type": "image_url",
                "image_url": {
                    "url": "https://air-example-data-2.s3.us-west-2.amazonaws.com/vllm_opensource_llava/stop_sign.jpg",
                },
            },
        ],
    }],
    extra_body={
        "stop_token_ids": [128009, 128001]
    }
)

如果你要使用 2.6 的话,stop_token_ids 应该是 [151645, 151643]

我的请求代码是这样的。目前在output之后还会跟着好多好多换行符,即便加了stop_token_ids也不太顶用。就很奇怪。。。给我的感觉就是默认一定要生成到max token数额的token才会停止推理。

 chat_response = client.chat.completions.create(
model="MiniCPM-Llama3-V-2_5",
messages=[{
    "role": "user",
    "content": [
        {"type": "text", "text": text_prompt},
        {
            "type": "image_url",
            "image_url": {
                "url": image_url,
            },
        },
    ],
}],
response_format={'type': 'json_object'},
extra_body={'stop_token_ids':[128009, 128001] , 'max_tokens': 512}
)
HwwwwwwwH commented 1 month ago

我的请求代码是这样的。目前在output之后还会跟着好多好多换行符,即便加了stop_token_ids也不太顶用。就很奇怪。。。给我的感觉就是默认一定要生成到max token数额的token才会停止推理。

 chat_response = client.chat.completions.create(
model="MiniCPM-Llama3-V-2_5",
messages=[{
    "role": "user",
    "content": [
        {"type": "text", "text": text_prompt},
        {
            "type": "image_url",
            "image_url": {
                "url": image_url,
            },
        },
    ],
}],
response_format={'type': 'json_object'},
extra_body={'stop_token_ids':[128009, 128001] , 'max_tokens': 512}
)

我尝试了一下,只要加上 response_format 这一行就会出现和你一样的问题,否则是正常的。

panxnan commented 1 month ago

我遇到同样的问题哦