CompVis / stable-diffusion

A latent text-to-image diffusion model
https://ommer-lab.com/research/latent-diffusion-models/
Other
66.54k stars 9.97k forks source link

question about the attention mask of text embedding #779

Open Microbiods opened 11 months ago

Microbiods commented 11 months ago

Hi, I am new to SD, I'd like to ask since the shape of text embedding extracted from CLIP is (bs, 77, 768) when I input this embedding to the UNet to predict noise, do I need to input the 'attention_mask' of this sentence? Or the text embedding has already carried with the padding information from the CLIP text encoder thus it is no need to input the attention_mask? What is the setting of the official SD?

Thanks a lot!