Closed minimini-1 closed 1 year ago
Hi, the model on huggingface space is the advanced version of mPLUG-Owl which natively support video with temporal related module as input without treating video as multiple frames. The video is tokenized into 65 tokens as image. We will release it very soon.
Ok, thanks for the answer!
Hello, thanks for the your great work!! I have some question about your Model inference code. I want to inference of the video data, but it support only image data. However, demo of the huggingface support video data as input.
I randomly pick the 8 frames of the video and save to the image_list. And change the prompts goes like this.
Is this process is same as the huggingface's demo code?? If not, could you tell me the detailed of the process of video in huggingface?
Thank you.