Low cosine similarity scores for InternVideo2 video embeddings vs text caption embeddings

mposey commented 3 months ago

I have been observing low cosine similarity scores for InternVideo2 video embeddings compared to relevant text caption embeddings. In some cases, the scores are even negative. I am not sure if I am misinterpreting how to compute the similarity scores for videos and their text captions.

Expected outcome: The cosine similarity scores should be high (close to 1) for videos with relevant captions.

Actual outcome: The cosine similarity scores are low (sometimes negative) for some videos with relevant captions.

UknowSth commented 2 months ago

Same problem was encountered. most of the relevant descriptive texts computed low cosine similarity, fluctuating above and below 0, and not taking values much higher than random texts. I wonder if I've miscalculated. Have you solved this problem？ The rough logic of my code is as follows.

with torch.no_grad():
        images = read_frames(...)
        images = image_transform(images)
        images = images.to(device)
        # load video
        image_feat = clip_model.encode_vision(images.unsqueeze(0), test=True).float()
        image_feat /= image_feat.norm(dim=-1, keepdim=True)    
        # load text
        text_feat = clip_model.encode_text(tokenizer([text],max_length=300).to(device)).float()
        text_feat /= text_feat.norm(dim=-1, keepdim=True) 

        logit_per_text =  image_feat @ text_feat.T
        score_per_video =  float(logit_per_text[0][0].cpu())

UknowSth commented 2 months ago

I have resolved. It was due to the model not loading the pretrained model, meaning the pretrained_path in the config file was not filled in. I achieved better results by loading the 1B_clip.pth from https://huggingface.co/OpenGVLab/InternVideo2-CLIP-1B-224p-f8/tree/main.

yshuolu commented 1 month ago

@UknowSth Could you please provide a bit more details to your solution? I encountered the same.

UknowSth commented 1 month ago

@UknowSth Could you please provide a bit more details to your solution? I encountered the same.

Take scripts/evaluation/clip/zero_shot/1B/config_anet.py as an example, where pretrained_path is not filled in by default. Download 1B_clip.pth from huggingface and then fill in pretrained_path.

After successfully loading weights of chinese_alpaca_lora_7b, 1B_clip.pth, InternVideo2-stage2_1b-224p-f4.pt, internvl_c_13b_224px.pth, the cosine similarity scores would be normal.

# load model
model = InternVideo2_CLIP(cfg, is_pretrain=False).to(device)
tokenizer = model.tokenizer
# load pretrained model
checkpoint = torch.load(cfg.pretrained_path, map_location="cpu")
if 'model' in checkpoint.keys():
    state_dict = checkpoint["model"]
elif 'module' in checkpoint.keys():
    state_dict = checkpoint["module"]
else:
    state_dict = checkpoint
msg = model.load_state_dict(state_dict, strict=False)
logger.info(msg)
logger.info(f"Loaded checkpoint from {cfg.pretrained_path}")
model.eval()

...

# inference
image_feat = model.encode_vision(images.unsqueeze(0), test=True).float()
image_feat /= image_feat.norm(dim=-1, keepdim=True)    

text_feat = model.encode_text(tokenizer([text],max_length=180).to(device)).float()
text_feat /= text_feat.norm(dim=-1, keepdim=True) 

logit_per_text =  image_feat @ text_feat.T
score_per_video = float(logit_per_text[0][0].cpu())

yshuolu commented 3 weeks ago

@UknowSth @mposey Which GPU model you used to run these inference? It seems that L4 with 23GB GPU memory would still hit Out of Memory crash when loading models.

OpenGVLab / InternVideo2

Low cosine similarity scores for InternVideo2 video embeddings vs text caption embeddings #3