a-r-r-o-w / cogvideox-factory

Memory optimized finetuning scripts for CogVideoX using TorchAO and DeepSpeed
Apache License 2.0
177 stars 16 forks source link

Allowing training without the learned positional embeddings in CogVideoX-I2V #26

Open a-r-r-o-w opened 1 day ago

a-r-r-o-w commented 1 day ago

If you take a look at the weights of the learned positional embedding in THUDM/CogVideoX-5b-I2V, you will find that the mean is close to 0 and standard deviation is very low. This is to say that the weights are effectively 0, and the model learned that those values are not so helpful while training. Since we also use RoPE, there is an alternative source of position information.

If you remove the learned positional embedding layer and try to use it for inference, you will still get good videos that are numerically similar (very low absmax difference) to the ones you generate using that layer. However, the learned PEs are of fixed size, and therefore you can't do multi-resolution/frame training. This is quite limiting. Since, from my testing, I've found that it has little-to-no effect, I'm considering to copy the implementations from diffusers and allow training without the learned PEs. I think it will be quite beneficial if one can generate high quality videos at multiple resolutions with the I2V model. WDYT?

@sayakpaul @zRzRzRzRzRzRzR @glide-the

cc @G-U-N too in case you've tried rectified diffusion on it

zRzRzRzRzRzRzR commented 1 day ago

Yes, this part(positional learnable embedding if your said) can be removed; we have verified that the effect is almost the same.

sayakpaul commented 1 day ago

Interesting conversations, here!

We could maybe try to also do something like Navit that would allow us to take the frames in their native aspect ratio.

a-r-r-o-w commented 1 day ago

The MovieGen paper did something similar too, I think. They used Navit PEs but no RoPE

lclichen commented 17 hours ago

Could I clarify if you are discussing about the removal of the entire RoPE layer here, or the comparison between the learnable RoPE and the RoPE during training mentioned in the paper Figure.4 (b)? image

a-r-r-o-w commented 6 hours ago

Nope, we will not be removing the RoPE layer yet. Just the learnable ones will be removed since they are very close to zero, and performing inference without that layer produces almost exactly the same results. This is because, with the learnable PEs, we can't do multi-resolution training as they are of fixed size. Ideally, it would be good to support upto 2048px.

In the near future, we could also try NaViT like approach (as used in MovieGen) with/without RoPE.

a-r-r-o-w commented 6 hours ago

Since we are onboard the plan of trying to train without the learned PEs, I'll go ahead and make a PR that: