想请教一下论文中的3D full attention的实现具体在哪里呢？

THUDM / CogVideo

text and image to video generation: CogVideoX (2024) and CogVideo (ICLR 2023)

Apache License 2.0

9.03k stars 854 forks source link

Open woshipapa opened 1 week ago

woshipapa commented 1 week ago

看到paper中提出了3D full attention和2D+1D attention的对比，并实验证明说生成视频的效果更好。论文中也提到可以用多种并行策略优化，想请教一下这部分有开源的代码吗

DidiD1 commented 1 week ago

3D full attention和2D+1D可以理解为patch化的方式不一样吧，3D full就是直接3个维度全都patch了，一个patch就是221（hwt），分开attention是分别保留了时间和空间维度信息的。

yzy-thu commented 1 week ago

3D full attention就是对整个序列做attention，时空分离的attention是指同一帧内部（空间）做attention或所有帧同一个位置（时间）做attention

woshipapa commented 5 days ago

3D full attention就是对整个序列做attention，时空分离的attention是指同一帧内部（空间）做attention或所有帧同一个位置（时间）做attention

请问是这里吗，在padding_embedding这边