YangLing0818 / ContextDiff

[ICLR 2024] Contextualized Diffusion Models for Text-Guided Image and Video Generation
https://openreview.net/forum?id=nFMS6wF2xq
56 stars 3 forks source link

mean.pth and std.pth #4

Closed zhw0516 closed 5 months ago

zhw0516 commented 5 months ago

Hello! Thanks for the amazing work! In the text-to-image part, I want to know how the mean.pth and std.pth files are generated. Looking forward to your reply!

BitCodingWalkin commented 5 months ago

Thank you for your attention for our work. As for your question about mean and std, mean.pth and std.pth represent the mean and variance of the embedding distribution for the corresponding dataset. In our implementation, you can directly create the mean and variance that fit an isotropic Gaussian distribution with the shape (1024, 1536) using PyTorch, and then save them as the initial mean and std. You can also extract the embeddings corresponding to the dataset through CLIP and use clustering methods such as GMM to obtain the mean and variance for the dataset. Our experiments have validated that using the dataset-specific mean and variance can accelerate the convergence speed of training.

zhw0516 commented 5 months ago

Thanks for your reply! I have another question, context-aware adapter need to be trained for text-to-image generation task, but why not train for text-to-video editing task? Is it okay if I use it for a text-to-image editing task?

YangLing0818 commented 5 months ago

Thanks for your reply! I have another question, context-aware adapter need to be trained for text-to-image generation task, but why not train for text-to-video editing task? Is it okay if I use it for a text-to-image editing task?

Our ContextDiff is a general diffusion method, as demonstrated in our paper (e.g., text/class/layout-to-image generation, text-to-vide editing). Therefore, you can try it with any text-conditional visual generation and editing tasks.

jupytera commented 1 month ago

Hello! Thanks for the amazing work! In the text-to-image part, I want to know how the mean.pth and std.pth files are generated. Looking forward to your reply!

Hello, if you have successfully used the text guided image generation function, I would like to ask for your specific usage instructions. The training model generated by train_adapter.py does not seem to be applied by finetune_diffusion.py. Did I make some mistakes? Thank you for your help!