Open hongsukchoi opened 10 months ago
Yes, you're right that there is no explicit alignment in Controlnet. What it does is just read the geometric control embeddings into features and add them to the SD intermediate feature.
hi, @lwchen6309 , by the way, i have another question: your test codes in runner_inpait.py, " input_prompt": "A digital painting of a half-frozen lake near mountains under a full moon and aurora. A boat is in the middle of the lake. Highly detailed."," now i have printed the value of the color cross_attention_weight_64 corresponding to token="aurora" like this:
[0.0000, 0.0000, 0.0000, 0.0000, 0.5000, 0.5000, 0.5000, 0.5000],
[0.0000, 0.0000, 0.0000, 0.0000, 0.5000, 0.5000, 0.5000, 0.5000],
[0.0000, 0.0000, 0.0000, 0.0000, 0.5000, 0.5000, 0.5000, 0.5000],
[0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000],
[0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
[0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
[0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
[0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000
so we guess the aurora's cross location will near upper right, same with token="full moon". But why should we also put another mask image file pointing out the moon's real postion into latent space like
latent_model_input = torch.cat([latent_model_input, mask, masked_image_latents], dim=1)
will this be duplicated with previous color cross_attention_weight? PTAL! thank you !!!
Hi, I think the image_mask is just to specify the region for inpainting. The object segmentation is still controlled by cross attention weight.
Thank you for your great work!
I have a question about the controlnet extension. It seems the text is spatially aligend witt the latemt embeddings orginally from SD, but how is the spatially alilgn between text and geometric control (ex. scribble) done?
Reading througth the code here, I think there is no alignment between the text embeddings and geometric control embeddings. Am I right?
Thank you!