Closed funykatebird closed 1 year ago
We did not release methods about processing videos. One way to perform this task might be computing visual features with our ResNet adaptor for the images and using average pooling to make them one feature matrix as the input to the transformer. Also, you can try using more complex ways for the pooling.
BTW, we have done experiments on video with our new project OFASys. See if it can help. https://github.com/OFA-Sys/OFASys
I want to generate a video caption, but how can I use the image sequence? I could not find out which part to modify to use more than one image for the caption.