question about the attention mask of text embedding

Hi, I am new to SD, I'd like to ask since the shape of text embedding extracted from CLIP is (bs, 77, 768) when I input this embedding to the UNet to predict noise, do I need to input the 'attention_mask' of this sentence? Or the text embedding has already carried with the padding information from the CLIP text encoder thus it is no need to input the attention_mask? What is the setting of the official SD?

Thanks a lot!

CompVis / stable-diffusion

question about the attention mask of text embedding #779