Closed tsWen0309 closed 1 year ago
Hello Flu0XeT1n,
Thanks for following up again, Textual Inversion generates new token embeddings in the CLIP text embedding space. These lines from the textual inversion script we provide showcase which embeddings are used. More details can be found in the original paper by Gal et al. 2022 if you're interested. The line you marked with a TODO
in the image you shared corresponds with this line in the training script.
No additional maps on top of the new tokens are needed for Textual Inversion.
-Brandon
Hi, it's me again. When I go though your code, I find you directly fix the pre-trained textual inversion weights to the CLIP model in your implementation. Why? By intuition, the weights from Textual Inversion and those from CLIP should not be in the same space. Should there be an MLP to transform these weights to the same space? Forgive my foolishness, I am still new in this area.