CompVis / stable-diffusion

A latent text-to-image diffusion model
https://ommer-lab.com/research/latent-diffusion-models/
Other
67.75k stars 10.11k forks source link

T5 instead of CLIP #35

Open betterze opened 2 years ago

betterze commented 2 years ago

Dear stable-diffusion team,

Thank you for sharing this great work. I really like it.

Have you consider using pretrained T5 encoder instead of pretrained CLIP? According to Imagen paper, T5-XXL is better than CLIP.

''' We also find that while T5-XXL and CLIP text encoders perform similarly on simple benchmarks such as MS-COCO, human evaluators prefer T5-XXL encoders over CLIP text encoders in both image-text alignment and image fidelity on DrawBench, a set of challenging and compositional prompts. '''

Thank you for your help.

Best Wishes,

Zongze

zyx1213271098 commented 8 months ago

Hi, I'm planning to replace a Clip TextEncoder with a T5 model recently. However, T5 has an encoder-decoder structure. Which layer of T5 should I use as the feature output for text tokens? tks

YZBPXX commented 6 months ago

Hi, I'm planning to replace a Clip TextEncoder with a T5 model recently. However, T5 has an encoder-decoder structure. Which layer of T5 should I use as the feature output for text tokens? tks

Hello, have you tried it? How did it work?