Yuliang-Liu / Monkey

【CVPR 2024 Highlight】Monkey (LMM): Image Resolution and Text Label Are Important Things for Large Multi-modal Models
MIT License
1.82k stars 128 forks source link

infer效果不符合预期,希望输出text但是和论文结果相差很远,请问是我代码出问题了吗?麻烦帮忙看一下,感谢🙏 #107

Closed xy1565838851 closed 2 months ago

xy1565838851 commented 3 months ago

from monkey_model.modeling_textmonkey import TextMonkeyLMHeadModel from monkey_model.tokenization_qwen import QWenTokenizer from monkey_model.configuration_monkey import MonkeyConfig

if name == "main": checkpoint_path = "/nas_works/408972/LLM/Monkey/Monkey-Chat" input_image = "0.jpg" input_str = "Read all the text in the image." device_map = "cuda"

Create model

config = MonkeyConfig.from_pretrained(
        checkpoint_path,
        trust_remote_code=True,
    )
model = TextMonkeyLMHeadModel.from_pretrained(checkpoint_path,
    config=config,
    device_map=device_map, trust_remote_code=True).eval()
tokenizer = QWenTokenizer.from_pretrained(checkpoint_path,
                                            trust_remote_code=True)
tokenizer.padding_side = 'left'
tokenizer.pad_token_id = tokenizer.eod_id
tokenizer.IMG_TOKEN_SPAN = 1024
# tokenizer.IMG_TOKEN_SPAN = config.visual["n_queries"]

input_str = f"<img>{input_image}</img> {input_str}"
input_ids = tokenizer(input_str, return_tensors='pt', padding='longest')

attention_mask = input_ids.attention_mask
input_ids = input_ids.input_ids

pred = model.generate(
input_ids=input_ids.cuda(),
attention_mask=attention_mask.cuda(),
do_sample=True,
num_beams=1,
max_new_tokens=32768,
# max_new_tokens=2048,
min_new_tokens=1024,
length_penalty=1,
num_return_sequences=1,
output_hidden_states=True,
use_cache=True,
pad_token_id=tokenizer.eod_id,
eos_token_id=tokenizer.eod_id,
)
response = tokenizer.decode(pred[0][input_ids.size(1):].cpu(), skip_special_tokens=False).strip()
print(f"Response:{response}")

image image

echo840 commented 3 months ago

您好,这个例子是我们在测试monkey时发现的?您推理所用的模型时textmonkey吧,请参考我们给出的textmonkey的demo代码进行测试。

xy1565838851 commented 3 months ago

您好,这个例子是我们在测试monkey时发现的?您推理所用的模型时textmonkey吧,请参考我们给出的textmonkey的demo代码进行测试。

您好,我参考你们的demo代码进行测试。当需要的回答较短时,可以正确输出,也就是按照你们的示例可以输出“third floor”的结果,但如果让他输出整张图像的所有text,就会出现不符合预期的结果。是无法输出长结果吗?

xy1565838851 commented 3 months ago

您好,这个例子是我们在测试monkey时发现的?您推理所用的模型时textmonkey吧,请参考我们给出的textmonkey的demo代码进行测试。

image 您好,请问这些参数怎么设置可以输出一个较好的结果?

echo840 commented 2 months ago

66RXCLU%U `339T6U6$C M7 我们的设置是这样子的。