Allowing training without the learned positional embeddings in CogVideoX-I2V

a-r-r-o-w commented 1 month ago

If you take a look at the weights of the learned positional embedding in THUDM/CogVideoX-5b-I2V, you will find that the mean is close to 0 and standard deviation is very low. This is to say that the weights are effectively 0, and the model learned that those values are not so helpful while training. Since we also use RoPE, there is an alternative source of position information.

If you remove the learned positional embedding layer and try to use it for inference, you will still get good videos that are numerically similar (very low absmax difference) to the ones you generate using that layer. However, the learned PEs are of fixed size, and therefore you can't do multi-resolution/frame training. This is quite limiting. Since, from my testing, I've found that it has little-to-no effect, I'm considering to copy the implementations from diffusers and allow training without the learned PEs. I think it will be quite beneficial if one can generate high quality videos at multiple resolutions with the I2V model. WDYT?

@sayakpaul @zRzRzRzRzRzRzR @glide-the

cc @G-U-N too in case you've tried rectified diffusion on it

zRzRzRzRzRzRzR commented 1 month ago

Yes, this part(positional learnable embedding if your said) can be removed; we have verified that the effect is almost the same.

sayakpaul commented 1 month ago

Interesting conversations, here!

We could maybe try to also do something like Navit that would allow us to take the frames in their native aspect ratio.

a-r-r-o-w commented 1 month ago

The MovieGen paper did something similar too, I think. They used Navit PEs but no RoPE

lclichen commented 1 month ago

Could I clarify if you are discussing about the removal of the entire RoPE layer here, or the comparison between the learnable RoPE and the RoPE during training mentioned in the paper Figure.4 (b)?

a-r-r-o-w commented 1 month ago

Nope, we will not be removing the RoPE layer yet. Just the learnable ones will be removed since they are very close to zero, and performing inference without that layer produces almost exactly the same results. This is because, with the learnable PEs, we can't do multi-resolution training as they are of fixed size. Ideally, it would be good to support upto 2048px.

In the near future, we could also try NaViT like approach (as used in MovieGen) with/without RoPE.

a-r-r-o-w commented 1 month ago

Since we are onboard the plan of trying to train without the learned PEs, I'll go ahead and make a PR that:

Copies CogVideoX transformer implementation from Diffusers
Adds a config option to not use the learnable PEs
Try an experimental run to see if this pans out by trying to overfit on a few multiresolution video samples
Try a longer training experiment on a subset of OpenVid-1m (either me/Yuxuan or both of our teams can try to work on it in parallel to get a multiresolution finetune of the model out if possible). We already have CogVideoX-Fun folks who have shown good multires generation capabilities, so maybe we can reach out them and learn from their findings too

trouble-maker007 commented 1 month ago

@a-r-r-o-w when training multiple resolution, it will raise error in here:https://github.com/huggingface/diffusers/blob/5956b68a6927126daffc2c5a6d1a9a189defe288/src/diffusers/models/embeddings.py#L422 may be it have to remove the "patch_embed.pos_embedding": "diffusion_pytorch_model-00001-of-00003.safetensors" in the ckpt

a-r-r-o-w commented 1 month ago

Yes, I'll push the changes for supporting training without learned positional embeddings by tomorrow. I have an ongoing experiment run trying to verify if it works

abcdvzz commented 1 month ago

Restructuring Vector Quantization with the Rotation Trick https://arxiv.org/abs/2410.06424 this paper might help too

trouble-maker007 commented 1 month ago

@a-r-r-o-w May I ask if you can successfully support other resolutions in the fully parameterized SFT of i2v after training without learned positional embeddings?

a-r-r-o-w commented 1 month ago

@trouble-maker007 I did a couple of experiments over the past two weeks and have had good indication that removing the learned PEs and performing multi-resolution training works well. I shared some results in #31 just now - in just about 1000 training steps at batch_size=4, the model starts to have better understanding of different resolutions. I think a longer training run and higher quality data would be required for a good multi-resolution model, which we plan to work on in the near future.

Additionally, all scripts currently use CogVideoXDPMSolverMultistepScheduler. This tends to have poorer results due to stochasticity introduced by random noise. I would recommend changing this to CogVideoXDDIMScheduler for any future experiments you might be planning

trouble-maker007 commented 1 month ago

@a-r-r-o-w The detailed reply feels good, I also found that using CogVideoXDDIMScheduler leads to better training results. and another question is, is the current limitation of i2v's multi-resolution full-parameter SFT due to GPU memory issues?

a-r-r-o-w commented 1 month ago

The reason for not providing full-parameter I2V finetuning scripts is that I haven't found time to validate correctness on a bigger training run yet. We will be doing some bigger training runs soon, so as soon as I've verified correctness, I'll add a script here. Memory issues are not a problem so far and you can get a decent batch size in (the memory plots for T2V SFT should match the requirements by I2V)

a-r-r-o-w / cogvideox-factory

Allowing training without the learned positional embeddings in CogVideoX-I2V #26