To my knowledge, Clip is used to find the image that most closely matches the one described in the prompt in the latent space of the neural network. So I can not understand why when I type "a yellow tiger" on Stylegan XL + Clip this is what colab gives me back:
Doesn't Imagenet have a class for tigers? It shouldn't generate it any better? Even when I write "a cat", it gives me this result back.
It sure doesn't look like a cat.
Am I missing something or there is an explanation to these results not being of better quality like the ones obtained by sampling?
This is mostly due to initialization and optimization. The network can definitely produce realistic-looking tigers, but the notebook is not tuned for this.
To my knowledge, Clip is used to find the image that most closely matches the one described in the prompt in the latent space of the neural network. So I can not understand why when I type "a yellow tiger" on Stylegan XL + Clip this is what colab gives me back: Doesn't Imagenet have a class for tigers? It shouldn't generate it any better? Even when I write "a cat", it gives me this result back. It sure doesn't look like a cat. Am I missing something or there is an explanation to these results not being of better quality like the ones obtained by sampling?