Open a-r-r-o-w opened 1 day ago
Yes, this part(positional learnable embedding if your said) can be removed; we have verified that the effect is almost the same.
Interesting conversations, here!
We could maybe try to also do something like Navit that would allow us to take the frames in their native aspect ratio.
The MovieGen paper did something similar too, I think. They used Navit PEs but no RoPE
Could I clarify if you are discussing about the removal of the entire RoPE layer here, or the comparison between the learnable RoPE and the RoPE during training mentioned in the paper Figure.4 (b)?
Nope, we will not be removing the RoPE layer yet. Just the learnable ones will be removed since they are very close to zero, and performing inference without that layer produces almost exactly the same results. This is because, with the learnable PEs, we can't do multi-resolution training as they are of fixed size. Ideally, it would be good to support upto 2048px.
In the near future, we could also try NaViT like approach (as used in MovieGen) with/without RoPE.
Since we are onboard the plan of trying to train without the learned PEs, I'll go ahead and make a PR that:
If you take a look at the weights of the learned positional embedding in THUDM/CogVideoX-5b-I2V, you will find that the mean is close to 0 and standard deviation is very low. This is to say that the weights are effectively 0, and the model learned that those values are not so helpful while training. Since we also use RoPE, there is an alternative source of position information.
If you remove the learned positional embedding layer and try to use it for inference, you will still get good videos that are numerically similar (very low absmax difference) to the ones you generate using that layer. However, the learned PEs are of fixed size, and therefore you can't do multi-resolution/frame training. This is quite limiting. Since, from my testing, I've found that it has little-to-no effect, I'm considering to copy the implementations from diffusers and allow training without the learned PEs. I think it will be quite beneficial if one can generate high quality videos at multiple resolutions with the I2V model. WDYT?
@sayakpaul @zRzRzRzRzRzRzR @glide-the
cc @G-U-N too in case you've tried rectified diffusion on it