Closed Zeqiang-Lai closed 7 months ago
We were interesting in getting better text understanding in diffusion model, but CLIP text encoder is pretty bad :) So, we choose Flan-UL2 as it's the best open source transformer encoder and train the model without CLIP (it's partially inspired by Imagen work) and it works)
Ok, I see. Thanks for the response :)
Thanks for sharing this great work to opensource community.
I am a little curious about the design choice of Kandinsky 3. Could you give some explanation why your give up the image prior in Kandinsky 2.1 and Kandinsky 2.2 ?
Does it means that the image prior kind of useless ?