YangLing0818 / VideoTetris

VideoTetris: Towards Compositional Text-To-Video Generation
https://arxiv.org/abs/2406.04277
194 stars 6 forks source link

where is the RefAttn is inserted in the pretrained T2V model? #2

Open Edwardmark opened 2 months ago

Edwardmark commented 2 months ago

It is clear that the Spatio-Temporal Compositional Diffusion is inserted in the cross attention of input noise and text prompt. But where is the RefAttn block inserted in the T2V model? And how to use it in inference? Could you show us a figure to illustrate it ?

YangLing0818 commented 2 months ago

It is clear that the Spatio-Temporal Compositional Diffusion is inserted in the cross attention of input noise and text prompt. But where is the RefAttn block inserted in the T2V model? And how to use it in inference? Could you show us a figure to illustrate it ?

WechatIMG274 Thanks for your attention! The position of RefAttn block is demonstrated in the above figure. Please check.

Edwardmark commented 2 months ago

It is clear that the Spatio-Temporal Compositional Diffusion is inserted in the cross attention of input noise and text prompt. But where is the RefAttn block inserted in the T2V model? And how to use it in inference? Could you show us a figure to illustrate it ?

WechatIMG274 Thanks for your attention! The position of RefAttn block is demonstrated in the above figure. Please check.

Thanks for your kind reply. Could you please explain what the Temp Attention mean in the figure ? I can not find it in the arxiv paper. Does it mean that the self-attention is attention for shape, and temp attention is attention for temporal axis?

onevfall commented 3 weeks ago

It is clear that the Spatio-Temporal Compositional Diffusion is inserted in the cross attention of input noise and text prompt. But where is the RefAttn block inserted in the T2V model? And how to use it in inference? Could you show us a figure to illustrate it ?

WechatIMG274 Thanks for your attention! The position of RefAttn block is demonstrated in the above figure. Please check.

Thanks for your kind reply. Could you please explain what the Temp Attention mean in the figure ? I can not find it in the arxiv paper. Does it mean that the self-attention is attention for shape, and temp attention is attention for temporal axis?

Video input shape is (b, t, h, w, c) , the shape is reshaped to (bt, hw, c)to be processed in the self attention, while the shape is reshaped to(bhw, t, c)to be processed in the temp attention.