Open lennartmoritz opened 3 weeks ago
Hi @lennartmoritz, I'm currently using this model for my project and I'm having the same issue with eval_msrvtt.sh.
I wrote my own script for model evaluation. Unfortunatelly, FT models does not show the expected results, but Large models are ok (LanguageBind_Video, LanguageBind_Audio)
You may try run my script, it gave me around 41.50 R@1, 65.80 R@5, 75.50 R@10
from collections import defaultdict
import torch
import pandas as pd
import numpy as np
from more_itertools import chunked
from tqdm.auto import tqdm
from languagebind import LanguageBind, to_device, transform_dict, LanguageBindImageTokenizer
def compute_metrics(x):
sx = np.sort(-x, axis=1)
d = np.diag(-x)
d = d[:, np.newaxis]
ind = sx - d
ind = np.where(ind == 0)
ind = ind[1]
metrics = {}
metrics['R1'] = float(np.sum(ind == 0)) * 100 / len(ind)
metrics['R5'] = float(np.sum(ind < 5)) * 100 / len(ind)
metrics['R10'] = float(np.sum(ind < 10)) * 100 / len(ind)
metrics['MR'] = np.median(ind) + 1
metrics["MedianR"] = metrics['MR']
metrics["MeanR"] = np.mean(ind) + 1
# metrics["cols"] = [int(i) for i in list(ind)]
return metrics
def main():
device = torch.device('cuda:0')
clip_type = {
'video': 'LanguageBind_Video',#_FT', # also LanguageBind_Video
'audio': 'LanguageBind_Audio',#_FT', # also LanguageBind_Audio
# 'image': 'LanguageBind_Image',
# 'thermal': 'LanguageBind_Thermal',
# 'depth': 'LanguageBind_Depth',
}
model = LanguageBind(clip_type=clip_type, cache_dir='./cache_dir').to(device)
model.eval()
tokenizer = LanguageBindImageTokenizer.from_pretrained('lb203/LanguageBind_Image', cache_dir='./cache_dir/tokenizer_cache_dir')
modality_transform = {c: transform_dict[c](model.modality_config[c]) for c in clip_type.keys()}
df = pd.read_csv('../data/MSRVTT/MSRVTT_JSFUSION_test.csv')
language_data = df['sentence'].values.tolist()
video_data = df['video_id'].apply(lambda x: str(f'../data/MSRVTT/videos/all/{x}.mp4')).values.tolist()
def embed(x: list[list], dtypes: list[str]) -> list:
inputs = {}
for data, dtype in zip(x, dtypes):
if dtype == 'language':
inputs['language'] = to_device(tokenizer(data, max_length=77, padding='max_length', truncation=True, return_tensors='pt'), device)
elif dtype in ['image', 'video', 'audio', 'depth', 'thermal', 'language']:
inputs[dtype] = to_device(modality_transform[dtype](data), device)
else:
raise
with torch.no_grad():
embeddings = model(inputs)
embeddings = {k: v.detach().cpu().numpy() for k, v in embeddings.items()}
return embeddings
batch_size = 16
results = defaultdict(lambda: np.random.rand(0, 768))
for batch in tqdm(list(zip(
chunked(language_data, batch_size),
chunked(video_data, batch_size)
))):
embeddings = embed(
batch,
dtypes=['language', 'video']
)
results['language'] = np.concatenate([results['language'], embeddings['language']])
results['video'] = np.concatenate([results['video'], embeddings['video']])
video = results['video']
language = results['language']
np.save('experiments/MSR-VTT_test_video_embeddings.npy', video)
np.save('experiments/MSR-VTT_test_language_embeddings.npy', language)
sim_matrix = torch.tensor(video @ language.T)
print('VT', compute_metrics(sim_matrix))
print('TV', compute_metrics(sim_matrix.T))
if __name__ == '__main__':
main()
Hey @e1four15f thank you for your code example. In the mean time, i wrote a similar script to yours based on the inference example script from the repo. But i've noticed, that this is considerably slower than when i used the eval script. I suspect it has to do with the used batch sizes. Have you found a way to select a batch size for inference with your script?
I am trying to verify/reproduce your paper's validation results without training it myself and expected 42.6% R@1 accuracy for MSR-VTT.
But when I follow the instructions from TRAIN_AND_VALIDATE.md (I only did the
eval.sh
, no training) I get results that are as bad as randomly guessing with about 0.1% R@1 accuracy. See myout.log
here:What I need:
Please tell me how i can select your final model for the eval script, which will lead to the same results you that you published.
What I suspect is wrong:
Well, I guess the issue is that I am trying to evaluate the untrained model here instead of your trained version. Maybe I misunderstood the instructions, and the pretrained weights I downloaded are not the same as your fully trained model described in the paper.
I have also tried to get your final model by running my
eval_msrvtt.sh
script with theTRANSFORMERS_OFFLINE=0
environment variable and an empty cache_dir in hopes of downloading the fully trained version. Strangely enough this leads to slightly different results in myout.log
:How to reproduce:
I follow TRAIN_AND_VALIDATE.md.
eval.sh
and save it aseval_msrvtt.sh
. Then execute the script.This is my eval_msrvtt.sh: