OpenGVLab / InternVL

[CVPR 2024 Oral] InternVL Family: A Pioneering Open-Source Alternative to GPT-4o. 接近GPT-4o表现的开源多模态对话模型
https://internvl.readthedocs.io/en/latest/
MIT License
6.09k stars 474 forks source link

Why is the similarity comparison result of InternViT-300M-448px for images very high #609

Open eaeful opened 1 month ago

eaeful commented 1 month ago

I want to use InternViT-300M-448px, this is the code I asked AI to write. I want to compare and output the similarity between images 1.jpg and 2.jpg. Why is the output similarity very high, about 0.9 or above? No matter if I switch to two images with high or low similarity, the similarity is still greater than 0.9, and it is difficult to see if they are similar. I used other models, such as siglip-so400m and clip ViT-L-14, which output normal similarity. I am a beginner in AI, please forgive me if there are any low-level errors This is the download link for hugginface:https://huggingface.co/OpenGVLab/InternViT-300M-448px/

我想用InternViT-300M-448px,这是我让ai写的代码。我想对于1.jpg和2.jpg的图像对比并输出相似度,为什么输出的相似度非常高,大概0.9以上,无论我换两张相似度很高的或者相似度很低的的图像,相似度仍然大于0.9,看不出来是否相似。我用其他模型,例如:siglip-so400m、clip ViT-L-14,都是输出正常的相似度。我是个ai新手,如果有低级错误,请见谅 这是huggingface 的下载地址:https://huggingface.co/OpenGVLab/InternViT-300M-448px/

import torch
from PIL import Image
from transformers import AutoModel, CLIPImageProcessor
from torch.nn.functional import cosine_similarity

# 选择设备
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

# 加载预训练的模型和图像处理器
model_path = 'E:/InternViT-300M-448px/'
model = AutoModel.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    trust_remote_code=True).to(device).eval()

image_processor = CLIPImageProcessor.from_pretrained(model_path)

# 定义一个函数来处理图像并获取特征
def get_image_features(image_path):
    image = Image.open(image_path).convert('RGB')
    pixel_values = image_processor(images=image, return_tensors='pt').pixel_values
    pixel_values = pixel_values.to(torch.bfloat16).to(device)
    with torch.no_grad():
        outputs = model(pixel_values)
    # 假设 pooler_output 是我们想要使用的特征表示
    features = outputs.pooler_output
    return features

# 获取两张图片的特征
features1 = get_image_features('./1.jpg')
features2 = get_image_features('./2.jpg')

# 计算余弦相似度
similarity = cosine_similarity(features1, features2).item()
print(f"The similarity between the two images is: {similarity:.4f}")

输出结果 Output result:

Using device: cuda
The similarity between the two images is: 1.0078
RuixiangZhao commented 1 week ago