Closed FangGet closed 1 month ago
That's really strange, we also test the effective length by direct truncation. And there's some other work (https://arxiv.org/pdf/2408.01181) also supports our finding.
thanks, I rechekced my codes and directly operate on the tokenized feature, it works now:
truncated_idx = 10-70
text_feature = clip.tokenize(text_list,truncate=True).to(device)
text_feature[:,truncated_idx:]=0
text_feature[:,truncated_idx]=clip.clip._tokenizer.encoder["<|endoftext|>"]
here are the test results: truncated_idx | I2T R@1 | T2I R@1 |
---|---|---|
10 | 0.117 | 0.105 |
20 | 0.293 | 0.275 |
30 | 0.422 | 0.371 |
40 | 0.513 | 0.425 |
50 | 0.570 | 0.481 |
60 | 0.616 | 0.505 |
70 | 0.649 | 0.512 |
ALL | 0.672 | 0.534 |
It's tested with Urban-1K, but It does not show the first 20 token's effectiveness, can you give some suggestion for me?thanks again.
Hello, sorry for late reply, I'm busy with TOEFL exam last week.
The original results is based on Text-to-Image on Urban-200 dataset, an early version of Urban dataset, so there may be some data gaps. And we reconduct the experiment on Urban-1k, Here are the result, which also proves the most effective length is around 20-30 tokens:
A possible reason for data gap is that we find CLIP is much more aware of the text in the image than detailes (eg. typographic attack), so if the description of the textual information on image is on around 30 tokens, the effective length may seems longer.
Here's our code for evaluation:
import json
import cv2
from PIL import Image
import sys
sys.path.append('../..')
from model import longclip
import torch
import torch.utils.data as data
import os
import numpy as np
import clip
image_root = '/mnt/petrelfs/zhangbeichen/Urban1k/image/'
caption_root = '/mnt/petrelfs/zhangbeichen/Urban1k/caption/'
class local_dataset(data.Dataset):
def __init__(self):
self.image_root = image_root
self.caption_root = caption_root
self.total_image = os.listdir(image_root)
self.total_caption = os.listdir(caption_root)
#model, preprocess = longclip.load("../../checkpoints/longclip-B.pt", device='cuda')
def __len__(self):
return len(self.total_caption)
def __getitem__(self, index):
caption_name = self.total_caption[index]
image_name = self.total_caption[index][:-4] + '.jpg'
image = Image.open(self.image_root + image_name)
f=open(self.caption_root + caption_name)
caption = f.readlines()[0]
caption = " ".join(caption.split(" "))
return image, caption
if __name__ == '__main__':
dataset = local_dataset()
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/16", device=device)
model.eval()
print("model done!")
img_feature_list = []
text_list_1 = []
text_list_2 = []
text_list = []
correct = 0
total = 0
with torch.no_grad():
for i, (image, caption) in enumerate(dataset):
text_list.append(caption)
text_feature = clip.tokenize(text_list, truncate=True).to(device)
text_feature = model.encode_text(text_feature)
text_feature /= text_feature.norm(dim=-1, keepdim=True)
for i, (image, caption) in enumerate(dataset):
image = preprocess(image).unsqueeze(0).to(device)
img_feature = model.encode_image(image)
img_feature_list.append(img_feature)
image_embeds = torch.cat(img_feature_list, dim=0)
image_embeds /= image_embeds.norm(dim=-1, keepdim=True)
print("text 2 image")
i = 0
correct = 0
total = 0
for i in range(text_feature.shape[0]):
text = text_feature[i]
sim = text @ image_embeds.T
sim = sim.squeeze()
correct_i = torch.argmax(sim)
if i==correct_i:
correct = correct + 1
total = total + 1
print(total)
print(correct)
print(correct/total)
print("image to text")
i = 0
correct = 0
total = 0
for i in range(image_embeds.shape[0]):
img = image_embeds[i]
sim = img @ text_feature.T
sim = sim.squeeze()
correct_i = torch.argmax(sim)
if i==correct_i:
correct = correct + 1
total = total + 1
print(total)
print(correct)
print(correct/total)
Feel free to contact me if you have other problems
thanks, I got the same result for full token length evaluation, but differences exist in truncated length, so how should I truncate the input tokens to [10-70] to get the similar as yours?
change this line caption = " ".join(caption.split(" "))
to caption = " ".join(caption.split(" ")[:10/20/30/...])
there may be some gap because of package version or gpus, you can contact me if there's a large gap.
ok, tiny difference due to package version, now I get the point for effective token length. thanks.
Hi, I have a question about your exploration of the effective length of CLIP. As said in the paper Sec3.1, you use Urban dataset to do the evaluation, but how? the captions of urban dataset are in average 110 words, so how to "incrementally increate the input caption length"? I first tested with directly truncation like "caption = caption[:10-40]", but the R@1 do not showing the rules as in paper. I also tested with setences truncation, but the length of caption is at least 40-60 words, and the R@1 keeps increasing with length growth.
Can you give some suggestion about the evaluation process? thanks.