[Question] How did you evaluate the effective length of CLIP.

FangGet commented 1 month ago

Hi, I have a question about your exploration of the effective length of CLIP. As said in the paper Sec3.1, you use Urban dataset to do the evaluation, but how? the captions of urban dataset are in average 110 words, so how to "incrementally increate the input caption length"? I first tested with directly truncation like "caption = caption[:10-40]", but the R@1 do not showing the rules as in paper. I also tested with setences truncation, but the length of caption is at least 40-60 words, and the R@1 keeps increasing with length growth.

Can you give some suggestion about the evaluation process? thanks.

beichenzbc commented 1 month ago

That's really strange, we also test the effective length by direct truncation. And there's some other work (https://arxiv.org/pdf/2408.01181) also supports our finding.

FangGet commented 1 month ago

thanks, I rechekced my codes and directly operate on the tokenized feature, it works now:

truncated_idx = 10-70
text_feature = clip.tokenize(text_list,truncate=True).to(device)
text_feature[:,truncated_idx:]=0
text_feature[:,truncated_idx]=clip.clip._tokenizer.encoder["<|endoftext|>"]

here are the test results: truncated_idx	I2T R@1	T2I R@1
10	0.117	0.105
20	0.293	0.275
30	0.422	0.371
40	0.513	0.425
50	0.570	0.481
60	0.616	0.505
70	0.649	0.512
ALL	0.672	0.534

It's tested with Urban-1K, but It does not show the first 20 token's effectiveness, can you give some suggestion for me?thanks again.

beichenzbc commented 1 month ago

Hello, sorry for late reply, I'm busy with TOEFL exam last week.

The original results is based on Text-to-Image on Urban-200 dataset, an early version of Urban dataset, so there may be some data gaps. And we reconduct the experiment on Urban-1k, Here are the result, which also proves the most effective length is around 20-30 tokens:

A possible reason for data gap is that we find CLIP is much more aware of the text in the image than detailes (eg. typographic attack), so if the description of the textual information on image is on around 30 tokens, the effective length may seems longer.

Here's our code for evaluation:

import json
import cv2
from PIL import Image
import sys
sys.path.append('../..')
from model import longclip
import torch
import torch.utils.data as data
import os
import numpy as np
import clip

image_root = '/mnt/petrelfs/zhangbeichen/Urban1k/image/'
caption_root = '/mnt/petrelfs/zhangbeichen/Urban1k/caption/'

class local_dataset(data.Dataset):
    def __init__(self):
        self.image_root = image_root
        self.caption_root = caption_root
        self.total_image = os.listdir(image_root)
        self.total_caption = os.listdir(caption_root)
        #model, preprocess = longclip.load("../../checkpoints/longclip-B.pt", device='cuda')
    def __len__(self):
        return len(self.total_caption)

    def __getitem__(self, index):
        caption_name = self.total_caption[index]
        image_name = self.total_caption[index][:-4] + '.jpg'
        image = Image.open(self.image_root + image_name)
        f=open(self.caption_root + caption_name)
        caption = f.readlines()[0]
        caption = " ".join(caption.split(" "))

        return image, caption

if __name__ == '__main__':
    dataset = local_dataset()
    device = "cuda" if torch.cuda.is_available() else "cpu"
    model, preprocess = clip.load("ViT-B/16", device=device)
    model.eval()

    print("model done!")

    img_feature_list = []
    text_list_1 = []
    text_list_2 = []
    text_list = []
    correct = 0
    total = 0

    with torch.no_grad():
        for i, (image, caption) in enumerate(dataset):
            text_list.append(caption)

        text_feature = clip.tokenize(text_list, truncate=True).to(device)
        text_feature = model.encode_text(text_feature)
        text_feature /= text_feature.norm(dim=-1, keepdim=True)

        for i, (image, caption) in enumerate(dataset):            
            image = preprocess(image).unsqueeze(0).to(device)
            img_feature = model.encode_image(image)
            img_feature_list.append(img_feature)

        image_embeds = torch.cat(img_feature_list, dim=0)
        image_embeds /= image_embeds.norm(dim=-1, keepdim=True)

        print("text 2 image")
        i = 0
        correct = 0
        total = 0
        for i in range(text_feature.shape[0]):
            text = text_feature[i]
            sim = text @ image_embeds.T
            sim = sim.squeeze()
            correct_i = torch.argmax(sim)

            if i==correct_i:
                correct = correct + 1
            total = total + 1
        print(total)
        print(correct)
        print(correct/total)

        print("image to text")
        i = 0
        correct = 0
        total = 0
        for i in range(image_embeds.shape[0]):
            img = image_embeds[i]
            sim = img @ text_feature.T
            sim = sim.squeeze()
            correct_i = torch.argmax(sim)

            if i==correct_i:
                correct = correct + 1
            total = total + 1
        print(total)
        print(correct)
        print(correct/total)

beichenzbc commented 1 month ago

Feel free to contact me if you have other problems

FangGet commented 1 month ago

thanks, I got the same result for full token length evaluation, but differences exist in truncated length, so how should I truncate the input tokens to [10-70] to get the similar as yours?

beichenzbc commented 1 month ago

change this line caption = " ".join(caption.split(" ")) to caption = " ".join(caption.split(" ")[:10/20/30/...])

beichenzbc commented 1 month ago

there may be some gap because of package version or gpus, you can contact me if there's a large gap.

FangGet commented 1 month ago

ok, tiny difference due to package version, now I get the point for effective token length. thanks.

beichenzbc / Long-CLIP

[Question] How did you evaluate the effective length of CLIP. #71