Open mrartemevmorphic opened 2 months ago
Hello! Thank you for your great paper and for publishing the code and checkpoints for the t2v models. While reading about how it all works, I had a number of questions. I hope you'll find some time to answer at least some of them. Feel free to direct me to your paper if it is already explained there :)
- What is the reasoning behind using such a small patch size = 2? Usually, I see patch sizes of 16 or 8 used, especially when generating 512x512 images.
- I see that you used LoRACompatible modules for linear projection. Have you thought about how this architecture could be expanded with LoRAs?
- Have you thought about adding some image-specific positional encoding to appended images?
- What is the purpose of args.fixed_spatial? In what cases would one want to train only spatial layers?
- In the provided training script, the decay for EMA is set to 0. Does that mean that the provided checkpoint was trained without EMA? Link: here
- Given that you are already passing "scaling_factor": 0.18215 to the VAE model, why do you scale it again in the training loop? Link: here
- Given that you are already doing attention masking in the encode_prompt function, why are you passing attention_mask and encoder_attention_mask arguments to the model's forward method? I'm not sure if I'm right, but it seems that both of these arguments are never used.
- How do you switch between using fp16 and fp32 in the training script?
- Training the model for more than 16 frames often results in checkerboard artifacts and significantly reduced quality. Do you think this is a limitation of the Latte model's architecture? I've seen that you recommend looking into autoregressive video modeling, but still, how can we effectively scale the number of frames generated from 16 to 32 without changing the architecture or sampling method?
- In the implementation of the BasicTransformerBlock, there is a lot of commented-out code with the cross-attention implementation. Does this mean that the pretrained checkpoint was trained without it?
Thank you again for your work, and I look forward to your answers!
Hi, thanks for your interest.
Hello! Thank you for your great paper and for publishing the code and checkpoints for the t2v models. While reading about how it all works, I had a number of questions. I hope you'll find some time to answer at least some of them. Feel free to direct me to your paper if it is already explained there :)
Thank you again for your work, and I look forward to your answers!