text and control input align

hongsukchoi commented 10 months ago

Thank you for your great work!

I have a question about the controlnet extension. It seems the text is spatially aligend witt the latemt embeddings orginally from SD, but how is the spatially alilgn between text and geometric control (ex. scribble) done?

Reading througth the code here, I think there is no alignment between the text embeddings and geometric control embeddings. Am I right?

Thank you!

lwchen6309 commented 10 months ago

Yes, you're right that there is no explicit alignment in Controlnet. What it does is just read the geometric control embeddings into features and add them to the SD intermediate feature.

t00350320 commented 8 months ago

hi, @lwchen6309 , by the way, i have another question: your test codes in runner_inpait.py, " input_prompt": "A digital painting of a half-frozen lake near mountains under a full moon and aurora. A boat is in the middle of the lake. Highly detailed."," now i have printed the value of the color cross_attention_weight_64 corresponding to token="aurora" like this:

        [0.0000, 0.0000, 0.0000, 0.0000, 0.5000, 0.5000, 0.5000, 0.5000],
        [0.0000, 0.0000, 0.0000, 0.0000, 0.5000, 0.5000, 0.5000, 0.5000],
        [0.0000, 0.0000, 0.0000, 0.0000, 0.5000, 0.5000, 0.5000, 0.5000],
        [0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000],
        [0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000

so we guess the aurora's cross location will near upper right, same with token="full moon". But why should we also put another mask image file pointing out the moon's real postion into latent space like

latent_model_input = torch.cat([latent_model_input, mask, masked_image_latents], dim=1)

will this be duplicated with previous color cross_attention_weight? PTAL! thank you !!!

lwchen6309 commented 8 months ago

Hi, I think the image_mask is just to specify the region for inpainting. The object segmentation is still controlled by cross attention weight.

cloneofsimo / paint-with-words-sd

text and control input align #46