brandontrabucco / da-fusion

Effective Data Augmentation With Diffusion Models
MIT License
222 stars 18 forks source link

Fix Textual Inversion pre-trained weights to CLIP model #10

Closed tsWen0309 closed 1 year ago

tsWen0309 commented 1 year ago

Hi, it's me again. When I go though your code, I find you directly fix the pre-trained textual inversion weights to the CLIP model in your implementation. Why? By intuition, the weights from Textual Inversion and those from CLIP should not be in the same space. Should there be an MLP to transform these weights to the same space? Forgive my foolishness, I am still new in this area. image

brandontrabucco commented 1 year ago

Hello Flu0XeT1n,

Thanks for following up again, Textual Inversion generates new token embeddings in the CLIP text embedding space. These lines from the textual inversion script we provide showcase which embeddings are used. More details can be found in the original paper by Gal et al. 2022 if you're interested. The line you marked with a TODO in the image you shared corresponds with this line in the training script.

No additional maps on top of the new tokens are needed for Textual Inversion.

-Brandon