baaivision / Emu

Emu Series: Generative Multimodal Models from BAAI
https://baaivision.github.io/emu2/
Apache License 2.0
1.66k stars 86 forks source link

Multi-image experiments #3

Closed vishaal27 closed 1 year ago

vishaal27 commented 1 year ago

Hey,

I had a question regarding the specific prompt setting for the multi-image results from figure 4 (left side) of the paper. From a brief skim of your inference.py script and the EMU modeling code inside models, I think the modifications to make this work would be something like this:

    prompt = "You will be presented with some images: [IMG]ImageContent[/IMG]." \ 
             "You will be able to see the images after I provide them to you." \ 
             "Please answer my questions based on the given images." \
             "[USER]: [IMG]<image><image><image><image><image><image><image><image><image><image>" \
             "<image><image><image><image><image><image><image><image><image><image><image><image>" \
             "<image><image><image><image><image><image><image><image><image><image>[/IMG]" \
             "This is the first image." \
             "[IMG]<image><image><image><image><image><image><image><image><image><image>" \
             "<image><image><image><image><image><image><image><image><image><image><image><image>" \
             "<image><image><image><image><image><image><image><image><image><image>[/IMG]" \
             "This is the second image." \
             "What's the difference?" \
             "[ASSISTANT]:" \

    samples = {"image": img, "prompt": prompt}

    output_text = emu_model.generate(
        samples,
        max_new_tokens=512,
        num_beams=5,
        length_penalty=0.0,
        repetition_penalty=1.0,
    )[0].strip()

    print(f"===> caption output: {output_text}")

Please verify if this is the prompt that you used for the multi-image results in the paper, and let me know if it would work.

The other alternative would be to prepend the [USER] token before every new image-text sequence. However, since that would be out-of-distribution with respect to the instruction fine-tuning data format, I am not sure it would work.

yqy2001 commented 1 year ago

Hello! We added support for accepting interleaved image-text as input for model inference. You can check out the updated code.

For interleaved image-text input, we do not add a system message, but you may try other system messages that make sense and see if they work.

Thank you.