Closed bo-miao closed 1 month ago
We didn't leverage word-level textual features can be obtained by encode_text_full
, which is used in SD models. We didn't use pixel-level spatial features. If you want them, you may comment the last three lines in ViT.forward function.
Thank you for your answer!
Thank you for your answer!
Hello, may I ask if you have a method to directly load word features from a model with trained weights?
Sorry I don't understand your question. Does 'the model with trained weights' means the pre-trained Long-CLIP model? Does the word feature means the embedding of each word in a sentence?
Hi,
Thanks for this interesting work! Could I know how to extract word-level textual features and pixel-level spatial features?