the resoult is not good by combining llama2 and sd

ShihaoZhaoZSH / LaVi-Bridge

[ECCV 2024] Bridging Different Language Models and Generative Vision Models for Text-to-Image Generation

MIT License

308 stars 20 forks source link

the resoult is not good by combining llama2 and sd #5

Closed xiaoyingzhengrui closed 6 months ago

xiaoyingzhengrui commented 6 months ago

thanks for sharing this good project. i want to generate the picture by using language model. i download the llama2-7b and the adapter as provided, but the resoult i got is not as good as the paper shows. so i want to know what's the precision about llama2-7b and sd-v1.4 models. i can see the code that is using llama2-7b in fp16, but not sure for the vae and u-net , i tested in fp32 and fp16, the picture style and detail is pretty strange.

ShihaoZhaoZSH commented 6 months ago

Thank you for your interest in our LaVi-Bridge! In our experiments, we utilize fp16 for obtaining the text embedding and subsequently convert it to fp32 for image generation, following the same way as shown in the script. We have made some updates, so please update the repository and give it another try.