Chat Template for Multi-Image Inference

Hi,

Thanks for the great work (mPLUG-OWL3)! I was wondering if the following template is the right chatting format for multi-image inference cuz the readme didn't explicitly mention it. When using the following code, it seems that the model successfully took a lot of images as input but the performance is under my expectation. Please let me know if my template is incorrect (specifically the real_prompt and message formation).

Looking forward to hearing from you. Thanks!

huggingface_model_id = 'mPLUG/mPLUG-Owl3-7B-240728'
model = AutoModelForCausalLM.from_pretrained(
    huggingface_model_id,
    torch_dtype=torch.half,
    attn_implementation="flash_attention_2",
    trust_remote_code=True
).eval().to("cuda")
tokenizer = AutoTokenizer.from_pretrained(huggingface_model_id)
processor = model.init_processor(tokenizer)

# Given a bunch of image paths image_paths = ['file1.png', 'file2.png', ...]

images = []
for idx, image_path in enumerate(image_paths):
    images.append(Image.open(image_path).convert("RGB"))
real_prompt = '<|image|>' * len(image_paths) + prompt
messages = [{"role": "user", "content": real_prompt}, {"role": "assistant", "content": ""}]
inputs = processor(messages, images=images, video=None).to("cuda")

generated_text = model.generate(**inputs, 
    tokenizer=tokenizer, max_new_tokens=256, decode_text=True)[0]

X-PLUG / mPLUG-Owl

Chat Template for Multi-Image Inference #248