OFA-Sys / Chinese-CLIP

Chinese version of CLIP which achieves Chinese cross-modal retrieval and representation generation.
MIT License
4.29k stars 447 forks source link

CN-CLIPViT-B/16模型在输入与向量化图片无关的输入时,如!@#¥,仍能有结果匹配 #330

Open wangyan828 opened 1 month ago

wangyan828 commented 1 month ago

1.我利用CN-CLIPViT-B/16模型向量化几张狗的图片,并将向量化的结果加到Groma向量数据库中;之后我在搜索匹配时输入特殊符号!@#¥时却会搜索到狗的图片?为什么与向量化图片无关的输入会搜到结果呢?

  1. 并且我在搜索“狗”时出现的结果中,有几张图片的相似度要比搜索!@#¥时要低,为什么与向量化图片无关输入的相似度会比有关系的输入还要低呢?
  2. 我的代码如下: `import onnxruntime from PIL import Image import numpy as np import torch import argparse import cn_clip.clip as clip from cn_clip.clip import load_from_name, available_models from cn_clip.clip.utils import _MODELS, _MODEL_INFO, _download, available_models, create_model, image_transform import chromadb import uuid

if name == "main": img_sess_options = onnxruntime.SessionOptions() img_run_options = onnxruntime.RunOptions() img_run_options.log_severity_level = 2

用fp16的模型会报警告

img_onnx_model_path="/usr/share/kylin-datamanagement-models/cn-clip-onnx/vit-b-16.img.fp32.onnx"
img_session = onnxruntime.InferenceSession(img_onnx_model_path,
                                        sess_options=img_sess_options,
                                        providers=["CPUExecutionProvider"])
model_arch = "ViT-B-16" 
preprocess = image_transform(_MODEL_INFO[model_arch]['input_resolution'])
image_path = "/home/wangyan/wangyan/ziliao/test-search/test/dog.jpeg"
image = preprocess(Image.open(image_path)).unsqueeze(0)
# print("get image shape of:", image.shape)

# 用ONNX模型计算图像侧特征
image_features = img_session.run(["unnorm_image_features"], {"image": image.cpu().numpy()})[0] # 未归一化的图像特征
image_features = torch.tensor(image_features)
# print(image_features.norm(dim=-1, keepdim=True))
image_features /= image_features.norm(dim=-1, keepdim=True) # 归一化后的Chinese-CLIP图像特征,用于下游任务

# 创建数据库 添加数据
embedded_as_lists = []
for array in image_features:
    embedded_list = [float(elem) for elem in array.flatten()]
    embedded_as_lists.append(embedded_list)
chroma_client = chromadb.PersistentClient(path = "/home/wangyan/文档/database")
collection = chroma_client.get_or_create_collection(name="usermanual", metadata={"hnsw:space": "cosine"})
# uuids = [str(uuid.uuid4()) for _ in embedded_as_lists]
# data = collection.add(
#     ids = uuids,
#     embeddings=embedded_as_lists
# )

# 载入ONNX文本侧模型(**请替换${DATAPATH}为实际的路径**)
txt_sess_options = onnxruntime.SessionOptions()
txt_run_options = onnxruntime.RunOptions()
txt_run_options.log_severity_level = 2
txt_onnx_model_path="/usr/share/kylin-datamanagement-models/cn-clip-onnx/vit-b-16.txt.fp32.onnx"
txt_session = onnxruntime.InferenceSession(txt_onnx_model_path,
                                        sess_options=txt_sess_options,
                                        providers=["CPUExecutionProvider"])

# 为4条输入文本进行分词。序列长度指定为52,需要和转换ONNX模型时保持一致(参见转换时的context-length参数)
text = clip.tokenize(["!@#¥"], context_length=52)
print("tokens:", text)
text_features = []
for i in range(len(text)):
    one_text = np.expand_dims(text[i].cpu().numpy(),axis=0)
    text_feature = txt_session.run(["unnorm_text_features"], {"text":one_text})[0] # 未归一化的文本特征
    # print(text_feature)
    text_feature = torch.tensor(text_feature)
    text_features.append(text_feature)
text_features = torch.squeeze(torch.stack(text_features),dim=1) # 4个特征向量stack到一起
text_features = text_features / text_features.norm(dim=1, keepdim=True) # 归一化后的Chinese-CLIP文本特征,用于下游任务

embedded_as = []
for array in text_features:
    embedded_list = [float(elem) for elem in array.flatten()]
    embedded_as.append(embedded_list)
data = collection.query(
    query_embeddings=embedded_as,
    n_results=10
)
distances = data.get('distances')[0]
results = [1 - d if isinstance(d, (int, float)) else None for d in distances]
print(results)`
ersanliqiao commented 1 month ago

这已经是困难问题了,卡的相似度阈值太高,语义文本无法召回图片,太低,一些简单无意义文本又会召回图片