why clip produces bad results

To my knowledge, Clip is used to find the image that most closely matches the one described in the prompt in the latent space of the neural network. So I can not understand why when I type "a yellow tiger" on Stylegan XL + Clip this is what colab gives me back: download Doesn't Imagenet have a class for tigers? It shouldn't generate it any better? Even when I write "a cat", it gives me this result back. It sure doesn't look like a cat. download (1) Am I missing something or there is an explanation to these results not being of better quality like the ones obtained by sampling?

autonomousvision / stylegan-xl

why clip produces bad results #50