Luodian / Otter

🦦 Otter, a multi-modal model based on OpenFlamingo (open-sourced version of DeepMind's Flamingo), trained on MIMIC-IT and showcasing improved instruction-following and in-context learning ability.
https://otter-ntu.github.io/
MIT License
3.55k stars 242 forks source link

[Feature Support] For Multi-Batch Data Inference Support #89

Open GuanlinLee opened 1 year ago

GuanlinLee commented 1 year ago

If you wish to generate descriptions for multiple images at one time. Simply use the following codes:

import requests
import torch
import transformers
from PIL import Image
from otter.modeling_otter import OtterForConditionalGeneration

model = OtterForConditionalGeneration.from_pretrained(
    "luodian/otter-9b-hf", device_map="auto"
)

tokenizer = model.text_tokenizer
image_processor = transformers.CLIPImageProcessor()
demo_image_one = Image.open(
    requests.get(
        "http://images.cocodataset.org/val2017/000000039769.jpg", stream=True
    ).raw
)
demo_image_two = Image.open(
    requests.get(
        "http://images.cocodataset.org/test-stuff2017/000000028137.jpg", stream=True
    ).raw
)
query_image = Image.open(
    requests.get(
        "http://images.cocodataset.org/test-stuff2017/000000028352.jpg", stream=True
    ).raw
)
vision_x = (
    image_processor.preprocess(
        [demo_image_one, demo_image_two, query_image], return_tensors="pt"
    )["pixel_values"]
    .unsqueeze(1)
    .unsqueeze(1) #Here, we reshape the input images into shape [B, 1, 1, 3, H, W], where T_img=1 and F=1.
)
model.text_tokenizer.padding_side = "left"
lang_x = model.text_tokenizer(
    [
        "<image> User: what does the image describe? GPT: <answer>", "<image> User: what does the image describe? GPT: <answer>", "<image> User: what does the image describe? GPT: <answer>" 
    ], #Here, we provide instructions for all images, respectively.
    return_tensors="pt", padding=True #To avoid different lengths of the instructions
)
generated_text = model.generate(
    vision_x=vision_x.to(model.device),
    lang_x=lang_x["input_ids"].to(model.device),
    attention_mask=lang_x["attention_mask"].to(model.device),
    max_new_tokens=256, #4 seconds; max_new_tokens=512, 7 seconds
    num_beams=1,
    no_repeat_ngram_size=3,
)
for i in range(vision_x.size(0):
    print(f"Generated text for image {i}: ", model.text_tokenizer.decode(generated_text[i]))
GuanlinLee commented 1 year ago

After some tests, the maximum batch size can be 20 for 4 RTX 3090. But the total inference time increases as well. Specifically, using 4 RTX 3090 to generate descriptions for one image will cost 8~9 seconds. When using 4 RTX 3090 to generate descriptions for 20 images at one time, the time cost is about 155 seconds, i.e., 7.75 s/image.

The pipeline can be further improved.

Luodian commented 1 year ago

As I could remember, generating 1 image's description takes 3-4 s/image. It's quite weird that multiple images inference are that slow. I guess it's by huggingface's device_map mechanism, the models are sharded into different devices. And the data tensors are copied from device to device during inference. Maybe copying 20 images tensor cost more time than just 1 image? It's my initial guess.

We are also working on improving training & inference efficiency. We now support xformers for Otter model. You can check the main branch's latest update.

GuanlinLee commented 1 year ago

Yes, if the tokens are 256, it will cost about 4 seconds. If you use 512 tokens, the time cost will double. I forget to point it out.

Enderfga commented 1 year ago

@Luodian @GuanlinLee May I know if otter supports simultaneous input of multiple images, for example, inputting two images at once and asking questions about this pair of images other than Multi-Batch Data Inference Support?

ZhangYuanhan-AI commented 1 year ago

Yes.

the data can be build like this.

[B,N,T,C,H,W] , B=1, N=1 and T=2 in this scenerio.

And we have a sub-task named spot_the_difference, this task is asking questions about the difference of a pair of images.

May we know what task you wanna Otter do, we can also include instruction data of the task you want!

Enderfga commented 1 year ago

Yes.

the data can be build like this.

[B,N,T,C,H,W] , B=1, N=2 and T=1 in this scenerio.

And we have a sub-task named sp�ot_the_difference, this task is asking questions about the difference of a pair of images.

May we know what task you wanna Otter do, we can also include instruction data of the task you want!

May I ask if there are any demonstration codes available for this sub-task in this repository? I couldn't find them in pipeline/demo.

Luodian commented 1 year ago

We truly dont have a demo for spot the diff. But our image demo here supports for multiple images input.

Sorry, just a small revision, the data format should be [B,N,T,C,H,W] , B=1, N=1 and T=2 in this spot_the_difference scenerio,.

In spot_the_difference scenerio, we can see two images are two frames in a video. And asking "what is the difference between of these two images."

Feel free to ask more about it. Our Otter's unique advantage is supporting both in-context inputs and video inputs.

Enderfga commented 1 year ago
vision_x = image_processor.preprocess([demo_image_one, demo_image_two], return_tensors="pt")["pixel_values"].unsqueeze(1).unsqueeze(0)
model.text_tokenizer.padding_side = "left"
lang_x = model.text_tokenizer(
[
"<image><image>User: What is the difference between of these two images? GPT:<answer>"
],
return_tensors="pt",
)

May I ask if it is like this? After I ran it, otter did not give the expected response, only replied with a single word "lighting."

ZhangYuanhan-AI commented 1 year ago
vision_x = image_processor.preprocess([demo_image_one, demo_image_two], return_tensors="pt")["pixel_values"].unsqueeze(1).unsqueeze(0)
model.text_tokenizer.padding_side = "left"
lang_x = model.text_tokenizer(
    [
        "<image><image>User: What is the difference between of these two images? GPT:<answer>"
    ],
    return_tensors="pt",
)

May I ask if it is like this? After I ran it, otter did not give the expected response, only replied with a single word "lighting."

vision_x = image_processor.preprocess([demo_image_one, demo_image_two], return_tensors="pt")["pixel_values"].unsqueeze(0).unsqueeze(0)
model.text_tokenizer.padding_side = "left"
lang_x = model.text_tokenizer(
    [
        "<image>User: What is the difference between of these two images? GPT:<answer>"
    ],
    return_tensors="pt",
)

Like this

However, the current Otter do not support spot_the_difference task, we will upload such model soon.

Luodian commented 1 year ago

Yes, should listen to @ZhangYuanhan-AI Yuanhan's suggestion 😉