qwen2-vl的视觉定位代码有bug

xiangxinhello commented 1 month ago

from qwen_vl_utils import process_vision_info
from transformers import AutoProcessor, AutoTokenizer, Qwen2VLForConditionalGeneration

model_path = "/workspace/mnt/storage/infer_tensor/Qwen2-VL-7B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(model_path)
processor = AutoProcessor.from_pretrained(model_path)
model = Qwen2VLForConditionalGeneration.from_pretrained(
    model_path,
    torch_dtype="auto",
    attn_implementation="flash_attention_2",
    device_map="auto",
)

messages = [
    [
        {
            "role": "user",
            "content": [
                # {
                #     "type": "image",
                #     "image": "/workspace/mnt/storage/xiangxin@supremind.com/infer_tensor/demo.jpeg",
                # },
                # {
                #     "type": "text",
                #     "text": "框出图中人脸的位置：",
                # },
                {
                    "type": "text",
                    "text": "你好",
                },
            ],
        },
    ],
]

# Preparation for inference
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(text=text, images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt")
inputs = inputs.to("cuda")

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)]

# Output
output_text = tokenizer.decode(generated_ids_trimmed[0], skip_special_tokens=False, clean_up_tokenization_spaces=False)
output_tokens = tokenizer.convert_ids_to_tokens(generated_ids_trimmed[0])
print(output_text)
print(output_tokens)

如果我输入不带图片，只有文字，ouput的结果有问题

output_text = 你好！有什么我可以帮助你的吗？<|im_end|> output_tokens = ['ä½łå¥½', 'ï¼ģ', 'æľīä»Ģä¹Ī', 'æĪĳåı¯ä»¥', 'å¸®åĬ©', 'ä½łçļĦ', 'åĲĹ', 'ï¼Ł', '<|im_end|>']

kq-chen commented 1 month ago

结果有问题是说output_tokens看起来像是乱码的东西么？这里和qwen的分词方案有关系，不是错了。可以这样看：

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-VL-7B-Instruct")

token_ids = [108386,   6313, 104139, 109944, 100364, 103929, 101037,  11319]
tokens = tokenizer.convert_ids_to_tokens(token_ids)
print(tokens)  # ['ä½łå¥½', 'ï¼ģ', 'æľīä»Ģä¹Ī', 'æĪĳåı¯ä»¥', 'å¸®åĬ©', 'ä½łçļĦ', 'åĲĹ', 'ï¼Ł']
token_strs = [tokenizer.convert_tokens_to_string([token]) for token in tokens]
print(token_strs)  # ['你好', '！', '有什么', '我可以', '帮助', '你的', '吗', '？']
print(tokenizer.decode(token_ids))  # 你好！有什么我可以帮助你的吗？

Embracex1998 commented 1 month ago

结果有问题是说output_tokens看起来像是乱码的东西么？这里和qwen的分词方案有关系，不是错了。可以这样看：

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-VL-7B-Instruct")

token_ids = [108386,   6313, 104139, 109944, 100364, 103929, 101037,  11319]
tokens = tokenizer.convert_ids_to_tokens(token_ids)
print(tokens)  # ['ä½łå¥½', 'ï¼ģ', 'æľīä»Ģä¹Ī', 'æĪĳåı¯ä»¥', 'å¸®åĬ©', 'ä½łçļĦ', 'åĲĹ', 'ï¼Ł']
token_strs = [tokenizer.convert_tokens_to_string([token]) for token in tokens]
print(token_strs)  # ['你好', '！', '有什么', '我可以', '帮助', '你的', '吗', '？']
print(tokenizer.decode(token_ids))  # 你好！有什么我可以帮助你的吗？

您好，救命稻草！请问VLLM遇到这个问题怎么解决呢

QwenLM / Qwen2-VL

qwen2-vl的视觉定位代码有bug #289

如果我输入不带图片，只有文字，ouput的结果有问题