Mplug_owl 2 support video training?

YuzhouPeng commented 10 months ago

Dose mplug_owl 2 support video training?

MAGAer13 commented 10 months ago

You can decode video as multiple images for training.

shaswati1 commented 10 months ago

You can decode video as multiple images for training.

Hi, I tried to use image frames from a video as sequence of images and tried inferencing on multiple images as below:

image_tensor = process_images([image1, image2], image_processor) query = "Summarize the images" with torch.inference_mode(): output_ids = model.generate( input_ids, images=image_tensor, do_sample=True, temperature=temperature, max_new_tokens=max_new_tokens, streamer=streamer, use_cache=True, stopping_criteria=[stopping_criteria])

outputs = tokenizer.decode(output_ids[0, input_ids.shape[1]:]).strip() print(outputs)

Even though the first line in the above code gives me a tensor of shape [2, 3, 448, 448], the summary generated by the model solely focus on the content of the image1. Is it the right way to do it?

YuzhouPeng commented 10 months ago

You can decode video as multiple images for training.

Hi, I tried to use image frames from a video as sequence of images and tried inferencing on multiple images as below:

image_tensor = process_images([image1, image2], image_processor) query = "Summarize the images" with torch.inference_mode(): output_ids = model.generate( input_ids, images=image_tensor, do_sample=True, temperature=temperature, max_new_tokens=max_new_tokens, streamer=streamer, use_cache=True, stopping_criteria=[stopping_criteria])

outputs = tokenizer.decode(output_ids[0, input_ids.shape[1]:]).strip() print(outputs)

Even though the first line in the above code gives me a tensor of shape [2, 3, 448, 448], the summary generated by the model solely focus on the content of the image1. Is it the right way to do it?

I am also curious how to use image sequence to understand entire video, how to build context? @MAGAer13

X-PLUG / mPLUG-Owl

Mplug_owl 2 support video training? #175