THUDM / CogVideo

Text-to-video generation. The repo for ICLR2023 paper "CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers"
Apache License 2.0
3.54k stars 378 forks source link

About using pretrained image model's weight in video task #18

Closed lemon-prog123 closed 1 year ago

lemon-prog123 commented 1 year ago

Hi ! I've read your paper. It's really a interesting job. Now I'm interested in the method you use in using pretrained weight from image model. I also want to try this method in my task. But It seems that your architecture is designed for autoregressive task, but I want to use it in a video classification task.

I wonder that would you like to give me some advice in finding a proper way to use image model's pretrained weight in a video task of transformer architecture.

wenyihong commented 1 year ago

To use pretrained weight from image model, we proposed dual-channel attention in our paper. It's a small modification in the transformer structure and can be applied in both autoregressive and non-autoregressive settings. You can directly try this in the classification task, and freeze one of the channels to a image classification model.

lemon-prog123 commented 1 year ago

Thank you for your advice. I'll try it then.