Closed zhw0516 closed 5 months ago
Thank you for your attention for our work. As for your question about mean and std, mean.pth and std.pth represent the mean and variance of the embedding distribution for the corresponding dataset. In our implementation, you can directly create the mean and variance that fit an isotropic Gaussian distribution with the shape (1024, 1536) using PyTorch, and then save them as the initial mean and std. You can also extract the embeddings corresponding to the dataset through CLIP and use clustering methods such as GMM to obtain the mean and variance for the dataset. Our experiments have validated that using the dataset-specific mean and variance can accelerate the convergence speed of training.
Thanks for your reply! I have another question, context-aware adapter need to be trained for text-to-image generation task, but why not train for text-to-video editing task? Is it okay if I use it for a text-to-image editing task?
Thanks for your reply! I have another question, context-aware adapter need to be trained for text-to-image generation task, but why not train for text-to-video editing task? Is it okay if I use it for a text-to-image editing task?
Our ContextDiff is a general diffusion method, as demonstrated in our paper (e.g., text/class/layout-to-image generation, text-to-vide editing). Therefore, you can try it with any text-conditional visual generation and editing tasks.
Hello! Thanks for the amazing work! In the text-to-image part, I want to know how the mean.pth and std.pth files are generated. Looking forward to your reply!
Hello, if you have successfully used the text guided image generation function, I would like to ask for your specific usage instructions. The training model generated by train_adapter.py does not seem to be applied by finetune_diffusion.py. Did I make some mistakes? Thank you for your help!
Hello! Thanks for the amazing work! In the text-to-image part, I want to know how the mean.pth and std.pth files are generated. Looking forward to your reply!