Closed m-violet-s closed 10 months ago
If anyone knows, could you please let me know?
Hey, bro, I met the same problem! But I also don't know how to solve it. If my memory serves me right, the FrozenCLIPTextEmbedder works theoretically. Because I used to fine-tune the text encoder for stable diffusion.
Thank you, I found that the feature vector sizes of the text generated by these two classes are different, and by looking at the code inside _self.model.encodetext(tokens) I changed the encode function inside the FrozenCLIPTextEmbedder class to
def encode(self, text):
tokens = clip.tokenize(text).to(self.device)
x = self.model.token_embedding(tokens).type(self.model.dtype) # [batch_size, n_ctx, d_model]
x = x + self.model.positional_embedding.type(self.model.dtype)
x = x.permute(1, 0, 2) # NLD -> LND
x = self.model.transformer(x)
x = x.permute(1, 0, 2) # LND -> NLD
z = self.model.ln_final(x).type(self.model.dtype) # # [batch, 77, 768]
return z
I found that the image generated this way matches his text, I wonder if you did the same thing as this before?
Thanks for your solution. Yes, similar to yours. I set the self.normalize as False in the FrozenCLIPTextEmbedder. And I remove the x = x[torch.arange(x.shape[0]), text.argmax(dim=-1)] @ self.text_projection in the Clip.model.py. You are right. Actually, the sequence length of the condition(aka. prompts) is different in the FrozenCLIPTextEmbedder (B, 1, 768) and FrozenCLIPEmbedder(B, 77, 768). You know the feature from the FrozenCLIPTextEmbedder denotes a sentence which means 1, but the feature from the FrozenCLIPEmbedder denotes each word in the sentence which means 77.
Hi, I am having the following problem when running
I understand from the _cond_stageconfig in the v1-inference.yaml file that the text encoder used is called from huggingface, and I can run txt2img.py normally through this encoder, but if I set the _con_stageconfig to call the text encoder from the clip official website encoder from the clip website, it results in an image that has nothing to do with the text. while I changed
Is this because stablediffusion is trained with the huggingface text encoder, which causes the text to not work when using FrozenCLIPTextEmbedder as the encoder?