Hi, the analyses of CLIP image encoder are quite straightforward. First, synthesizing noisy images from clean ones (using Gaussian or poison noise). Then, directly send these noisy and clean images to the CLIP ResNet encoder, without crop, resize or normalization as done in the original CLIP preprocessing. Finally, obtain the dense features of noisy images and clean images from CLIP ResNet encoder, respectively, and compute their similarities (using e.g., cosine distance or CKA similarity).

alwaysuu / CLIPDenoising

CVPR2024: Transfer CLIP for Generalizable Image Denoising

MIT License

40 stars 1 forks source link

Hi, the analyses of CLIP image encoder are quite straightforward. First, synthesizing noisy images from clean ones (using Gaussian or poison noise). Then, directly send these noisy and clean images to the CLIP ResNet encoder, without crop, resize or normalization as done in the original CLIP preprocessing. Finally, obtain the dense features of noisy images and clean images from CLIP ResNet encoder, respectively, and compute their similarities (using e.g., cosine distance or CKA similarity).

Thanks for your great work.

I'm really wondering if you can provide the feature comparison code between features from noisy and clean image respectively. Because two issues are here: first, how to preprocess the images? I'm a new one to CLIP, I find the preprocessing is different from what you said, I'm not sure if I find the wrong code of CLIP (https://github.com/openai/CLIP/blob/main/clip/clip.py#L79); second, there's many methods to calculate the distance between the features even just for cosine distance. Should I calculate the cosine distance between the corresponding feature channel and average them? Or should I use torch.nn.CosineSimilarity to calculate dim 1 similarity and average them?

Sorry for the late reply. We used the following code snippets:

from PIL import Image
import torch
import clip
from torchvision.transforms import Compose,ToTensor

def _convert_image_to_rgb(image):
    return image.convert("RGB")

def _transform():
    return Compose([
        _convert_image_to_rgb,
        ToTensor(),
    ])

def feature_similarity_check(image_input, image_gt, model):
    similarities = []
    with torch.no_grad():
        image_features = model.encode_image(image_input)
        image_features_gt = model.encode_image(image_gt)

    for image_feature, image_feature_gt in zip(image_features, image_features_gt):
        print(image_feature.shape)
        B, C, H, W = image_feature.shape
        image_feature = image_feature.reshape(B, -1)
        image_feature_gt = image_feature_gt.reshape(B, -1)
        similarities.append(torch.cosine_similarity(image_feature.double(), image_feature_gt.double(), dim=1).item())
return similarities

model, _ = clip.load("RN50.pt")
model.cuda().eval()

preprocess = _transform()
image = Image.open('your.png')).convert("RGB")
img_torch = preprocess(image).unsqueeze(0).cuda()
img_torch_noisy = img_torch + torch.randn_like(img_torch) * 0.1 # add gaussian noise with std=0.1 (0-1range)
sim = feature_similarity_check(img_torch_noisy, img_torch, model) 

print(sim)