Open a-r-r-o-w opened 1 month ago
Yes, this part(positional learnable embedding if your said) can be removed; we have verified that the effect is almost the same.
Interesting conversations, here!
We could maybe try to also do something like Navit that would allow us to take the frames in their native aspect ratio.
The MovieGen paper did something similar too, I think. They used Navit PEs but no RoPE
Could I clarify if you are discussing about the removal of the entire RoPE layer here, or the comparison between the learnable RoPE and the RoPE during training mentioned in the paper Figure.4 (b)?
Nope, we will not be removing the RoPE layer yet. Just the learnable ones will be removed since they are very close to zero, and performing inference without that layer produces almost exactly the same results. This is because, with the learnable PEs, we can't do multi-resolution training as they are of fixed size. Ideally, it would be good to support upto 2048px.
In the near future, we could also try NaViT like approach (as used in MovieGen) with/without RoPE.
Since we are onboard the plan of trying to train without the learned PEs, I'll go ahead and make a PR that:
@a-r-r-o-w when training multiple resolution, it will raise error in here:https://github.com/huggingface/diffusers/blob/5956b68a6927126daffc2c5a6d1a9a189defe288/src/diffusers/models/embeddings.py#L422 may be it have to remove the "patch_embed.pos_embedding": "diffusion_pytorch_model-00001-of-00003.safetensors" in the ckpt
Yes, I'll push the changes for supporting training without learned positional embeddings by tomorrow. I have an ongoing experiment run trying to verify if it works
Restructuring Vector Quantization with the Rotation Trick https://arxiv.org/abs/2410.06424 this paper might help too
@a-r-r-o-w May I ask if you can successfully support other resolutions in the fully parameterized SFT of i2v after training without learned positional embeddings?
@trouble-maker007 I did a couple of experiments over the past two weeks and have had good indication that removing the learned PEs and performing multi-resolution training works well. I shared some results in #31 just now - in just about 1000 training steps at batch_size=4, the model starts to have better understanding of different resolutions. I think a longer training run and higher quality data would be required for a good multi-resolution model, which we plan to work on in the near future.
Additionally, all scripts currently use CogVideoXDPMSolverMultistepScheduler. This tends to have poorer results due to stochasticity introduced by random noise. I would recommend changing this to CogVideoXDDIMScheduler for any future experiments you might be planning
@a-r-r-o-w The detailed reply feels good, I also found that using CogVideoXDDIMScheduler leads to better training results. and another question is, is the current limitation of i2v's multi-resolution full-parameter SFT due to GPU memory issues?
The reason for not providing full-parameter I2V finetuning scripts is that I haven't found time to validate correctness on a bigger training run yet. We will be doing some bigger training runs soon, so as soon as I've verified correctness, I'll add a script here. Memory issues are not a problem so far and you can get a decent batch size in (the memory plots for T2V SFT should match the requirements by I2V)
If you take a look at the weights of the learned positional embedding in THUDM/CogVideoX-5b-I2V, you will find that the mean is close to 0 and standard deviation is very low. This is to say that the weights are effectively 0, and the model learned that those values are not so helpful while training. Since we also use RoPE, there is an alternative source of position information.
If you remove the learned positional embedding layer and try to use it for inference, you will still get good videos that are numerically similar (very low absmax difference) to the ones you generate using that layer. However, the learned PEs are of fixed size, and therefore you can't do multi-resolution/frame training. This is quite limiting. Since, from my testing, I've found that it has little-to-no effect, I'm considering to copy the implementations from diffusers and allow training without the learned PEs. I think it will be quite beneficial if one can generate high quality videos at multiple resolutions with the I2V model. WDYT?
@sayakpaul @zRzRzRzRzRzRzR @glide-the
cc @G-U-N too in case you've tried rectified diffusion on it