autonomousvision / stylegan-xl

[SIGGRAPH'22] StyleGAN-XL: Scaling StyleGAN to Large Diverse Datasets
MIT License
961 stars 113 forks source link

why clip produces bad results #50

Closed eliohead closed 2 years ago

eliohead commented 2 years ago

To my knowledge, Clip is used to find the image that most closely matches the one described in the prompt in the latent space of the neural network. So I can not understand why when I type "a yellow tiger" on Stylegan XL + Clip this is what colab gives me back: download Doesn't Imagenet have a class for tigers? It shouldn't generate it any better? Even when I write "a cat", it gives me this result back. It sure doesn't look like a cat. download (1) Am I missing something or there is an explanation to these results not being of better quality like the ones obtained by sampling?

xl-sr commented 2 years ago

This is mostly due to initialization and optimization. The network can definitely produce realistic-looking tigers, but the notebook is not tuned for this.