Open mposey opened 3 months ago
Same problem was encountered. most of the relevant descriptive texts computed low cosine similarity, fluctuating above and below 0, and not taking values much higher than random texts. I wonder if I've miscalculated. Have you solved this problem? The rough logic of my code is as follows.
with torch.no_grad():
images = read_frames(...)
images = image_transform(images)
images = images.to(device)
# load video
image_feat = clip_model.encode_vision(images.unsqueeze(0), test=True).float()
image_feat /= image_feat.norm(dim=-1, keepdim=True)
# load text
text_feat = clip_model.encode_text(tokenizer([text],max_length=300).to(device)).float()
text_feat /= text_feat.norm(dim=-1, keepdim=True)
logit_per_text = image_feat @ text_feat.T
score_per_video = float(logit_per_text[0][0].cpu())
I have resolved. It was due to the model not loading the pretrained model, meaning the pretrained_path in the config file was not filled in. I achieved better results by loading the 1B_clip.pth from https://huggingface.co/OpenGVLab/InternVideo2-CLIP-1B-224p-f8/tree/main.
@UknowSth Could you please provide a bit more details to your solution? I encountered the same.
@UknowSth Could you please provide a bit more details to your solution? I encountered the same.
Take scripts/evaluation/clip/zero_shot/1B/config_anet.py
as an example, where pretrained_path
is not filled in by default.
Download 1B_clip.pth from huggingface and then fill in pretrained_path
.
After successfully loading weights of chinese_alpaca_lora_7b, 1B_clip.pth, InternVideo2-stage2_1b-224p-f4.pt, internvl_c_13b_224px.pth, the cosine similarity scores would be normal.
# load model
model = InternVideo2_CLIP(cfg, is_pretrain=False).to(device)
tokenizer = model.tokenizer
# load pretrained model
checkpoint = torch.load(cfg.pretrained_path, map_location="cpu")
if 'model' in checkpoint.keys():
state_dict = checkpoint["model"]
elif 'module' in checkpoint.keys():
state_dict = checkpoint["module"]
else:
state_dict = checkpoint
msg = model.load_state_dict(state_dict, strict=False)
logger.info(msg)
logger.info(f"Loaded checkpoint from {cfg.pretrained_path}")
model.eval()
...
# inference
image_feat = model.encode_vision(images.unsqueeze(0), test=True).float()
image_feat /= image_feat.norm(dim=-1, keepdim=True)
text_feat = model.encode_text(tokenizer([text],max_length=180).to(device)).float()
text_feat /= text_feat.norm(dim=-1, keepdim=True)
logit_per_text = image_feat @ text_feat.T
score_per_video = float(logit_per_text[0][0].cpu())
@UknowSth @mposey Which GPU model you used to run these inference? It seems that L4 with 23GB GPU memory would still hit Out of Memory crash when loading models.
I have been observing low cosine similarity scores for InternVideo2 video embeddings compared to relevant text caption embeddings. In some cases, the scores are even negative. I am not sure if I am misinterpreting how to compute the similarity scores for videos and their text captions.
Expected outcome: The cosine similarity scores should be high (close to 1) for videos with relevant captions.
Actual outcome: The cosine similarity scores are low (sometimes negative) for some videos with relevant captions.