X-PLUG / mPLUG-Owl

mPLUG-Owl: The Powerful Multi-modal Large Language Model Family
https://www.modelscope.cn/studios/damo/mPLUG-Owl
MIT License
2.25k stars 171 forks source link

Mplug_owl 2 support video training? #175

Closed YuzhouPeng closed 10 months ago

YuzhouPeng commented 10 months ago

Dose mplug_owl 2 support video training?

MAGAer13 commented 10 months ago

You can decode video as multiple images for training.

shaswati1 commented 10 months ago

You can decode video as multiple images for training.

Hi, I tried to use image frames from a video as sequence of images and tried inferencing on multiple images as below:

image_tensor = process_images([image1, image2], image_processor) query = "Summarize the images" with torch.inference_mode(): output_ids = model.generate( input_ids, images=image_tensor, do_sample=True, temperature=temperature, max_new_tokens=max_new_tokens, streamer=streamer, use_cache=True, stopping_criteria=[stopping_criteria])

outputs = tokenizer.decode(output_ids[0, input_ids.shape[1]:]).strip() print(outputs)

Even though the first line in the above code gives me a tensor of shape [2, 3, 448, 448], the summary generated by the model solely focus on the content of the image1. Is it the right way to do it?

YuzhouPeng commented 10 months ago

You can decode video as multiple images for training.

Hi, I tried to use image frames from a video as sequence of images and tried inferencing on multiple images as below:

image_tensor = process_images([image1, image2], image_processor) query = "Summarize the images" with torch.inference_mode(): output_ids = model.generate( input_ids, images=image_tensor, do_sample=True, temperature=temperature, max_new_tokens=max_new_tokens, streamer=streamer, use_cache=True, stopping_criteria=[stopping_criteria])

outputs = tokenizer.decode(output_ids[0, input_ids.shape[1]:]).strip() print(outputs)

Even though the first line in the above code gives me a tensor of shape [2, 3, 448, 448], the summary generated by the model solely focus on the content of the image1. Is it the right way to do it?

I am also curious how to use image sequence to understand entire video, how to build context? @MAGAer13