beichenzbc / Long-CLIP

[ECCV 2024] official code for "Long-CLIP: Unlocking the Long-Text Capability of CLIP"
Apache License 2.0
576 stars 27 forks source link

how to extract word and spatial feature #57

Closed bo-miao closed 1 month ago

bo-miao commented 1 month ago

Hi,

Thanks for this interesting work! Could I know how to extract word-level textual features and pixel-level spatial features?

beichenzbc commented 1 month ago

We didn't leverage word-level textual features can be obtained by encode_text_full, which is used in SD models. We didn't use pixel-level spatial features. If you want them, you may comment the last three lines in ViT.forward function.

bo-miao commented 1 month ago

Thank you for your answer!

liuwanqingqing commented 1 month ago

Thank you for your answer!

Hello, may I ask if you have a method to directly load word features from a model with trained weights?

beichenzbc commented 4 weeks ago

Sorry I don't understand your question. Does 'the model with trained weights' means the pre-trained Long-CLIP model? Does the word feature means the embedding of each word in a sentence?