PKU-YuanGroup / LanguageBind

【ICLR 2024🔥】 Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment
https://arxiv.org/abs/2310.01852
MIT License
723 stars 52 forks source link

用于特征提取对齐,选用输出为什么参数 #20

Closed xiaohaochen0308 closed 11 months ago

xiaohaochen0308 commented 11 months ago

import torch from languagebind import LanguageBindImage, LanguageBindImageTokenizer, LanguageBindImageProcessor

pretrained_ckpt = 'LanguageBind/LanguageBind_Image' model = LanguageBindImage.from_pretrained(pretrained_ckpt, cache_dir='./cache_dir') tokenizer = LanguageBindImageTokenizer.from_pretrained(pretrained_ckpt, cache_dir='./cache_dir') image_process = LanguageBindImageProcessor(model.config, tokenizer)

model.eval() data = image_process([r"your/image.jpg"], ['your text.'], return_tensors='pt') with torch.no_grad(): out = model(**data)

print(out.text_embeds @ out.image_embeds.T)

您好, 请问我如果加载LanguageBind_Image模型,用于图像和文本特征的提取对齐,那么我是用 out.text_embed和 out.image_embeds 这两个进行后续的工作吗?比如后续进行融合分类。

LinB203 commented 11 months ago

[Chinese] 是的。 [En] Yes.