Question about inference of the video data.

X-PLUG / mPLUG-Owl

mPLUG-Owl: The Powerful Multi-modal Large Language Model Family

https://www.modelscope.cn/studios/damo/mPLUG-Owl

MIT License

2.25k stars 171 forks source link

Question about inference of the video data. #63

Closed minimini-1 closed 1 year ago

minimini-1 commented 1 year ago

Hello, thanks for the your great work!! I have some question about your Model inference code. I want to inference of the video data, but it support only image data. However, demo of the huggingface support video data as input.

I randomly pick the 8 frames of the video and save to the image_list. And change the prompts goes like this.

Is this process is same as the huggingface's demo code?? If not, could you tell me the detailed of the process of video in huggingface?

Thank you.

MAGAer13 commented 1 year ago

Hi, the model on huggingface space is the advanced version of mPLUG-Owl which natively support video with temporal related module as input without treating video as multiple frames. The video is tokenized into 65 tokens as image. We will release it very soon.

minimini-1 commented 1 year ago

Ok, thanks for the answer!