Open ProkopHapala opened 2 years ago
You should be interpolating the text embeddings, not the latent space versions of the images. The latent space of an image is a small image (size 64 x 64 x 4). The text embedding space is 77 x 768, and encodes text semantics instead of pixels.
Most people recommend using spherical interpolation (slerp), but just regular linear interpolation seems to give fairly reasonable images.
In this paper (Fig.6) https://arxiv.org/pdf/2010.02502.pdf they show that it is possible to interpolate sematically the images in latent space.
I tried it with Colab verision of stable diffusion here https://colab.research.google.com/drive/11xRHNFskeBse0J4m5U3-FhUyw4c1mNch?usp=sharing
simpole code looks like this like this:
Rather than sematic interpolation in seems to do just simple interpolation in the image space: like this https://ibb.co/n7RrBrn
Why? What I do wrong? Is is somehow possible to achieve sematic-interpolation like described in the paper (Fig.6) https://arxiv.org/pdf/2010.02502.pdf
================= For completenes there are the functions ===============