PKU-YuanGroup / LanguageBind

【ICLR 2024🔥】 Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment
https://arxiv.org/abs/2310.01852
MIT License
549 stars 44 forks source link

Non-reproducible MSRVTT results - I get R@1 accuracy less than 1% #51

Open lennartmoritz opened 3 weeks ago

lennartmoritz commented 3 weeks ago

I am trying to verify/reproduce your paper's validation results without training it myself and expected 42.6% R@1 accuracy for MSR-VTT.

But when I follow the instructions from TRAIN_AND_VALIDATE.md (I only did the eval.sh, no training) I get results that are as bad as randomly guessing with about 0.1% R@1 accuracy. See my out.log here:

Eval Epoch: 0, eval Video-Text Retrieval under MSRVTT test data 2024-04-21,14:07:56 | INFO | MSRVTT sim matrix size: 1000, 1000 2024-04-21,15:02:43 | INFO | Length-T: 1000, Length-V:1000 2024-04-21,15:02:47 | INFO | MSRVTT Text-to-Video: 2024-04-21,15:02:53 | INFO | >>> R@1: 0.0 - R@5: 0.6 - R@10: 0.8 - Median R: 516.0 - Mean R: 518.7 2024-04-21,15:03:00 | INFO | MSRVTT Video-to-Text: 2024-04-21,15:03:03 | INFO | >>> V2T$R@1: 0.1 - V2T$R@5: 0.6 - V2T$R@10: 0.8 - V2T$Median R: 491.0 - V2T$Mean R: 498.2

What I need:

Please tell me how i can select your final model for the eval script, which will lead to the same results you that you published.

What I suspect is wrong:

Well, I guess the issue is that I am trying to evaluate the untrained model here instead of your trained version. Maybe I misunderstood the instructions, and the pretrained weights I downloaded are not the same as your fully trained model described in the paper.

I have also tried to get your final model by running my eval_msrvtt.sh script with the TRANSFORMERS_OFFLINE=0 environment variable and an empty cache_dir in hopes of downloading the fully trained version. Strangely enough this leads to slightly different results in my out.log:

2024-04-19,13:59:28 | INFO | downloading https://huggingface.co/laion/CLIP-ViT-L-14-DataComp.XL-s13B-b90K/resolve/main/tokenizer_config.json to /raid/1moritz/models/languagebind/cache_dir/tmpctkzbg3u 2024-04-19,13:59:29 | INFO | downloading https://huggingface.co/laion/CLIP-ViT-L-14-DataComp.XL-s13B-b90K/resolve/main/vocab.json to /raid/1moritz/models/languagebind/cache_dir/tmp6_ww7ayw 2024-04-19,13:59:29 | INFO | downloading https://huggingface.co/laion/CLIP-ViT-L-14-DataComp.XL-s13B-b90K/resolve/main/merges.txt to /raid/1moritz/models/languagebind/cache_dir/tmp3g7ehptb 2024-04-19,13:59:30 | INFO | downloading https://huggingface.co/laion/CLIP-ViT-L-14-DataComp.XL-s13B-b90K/resolve/main/tokenizer.json to /raid/1moritz/models/languagebind/cache_dir/tmp4h042saq 2024-04-19,13:59:31 | INFO | downloading https://huggingface.co/laion/CLIP-ViT-L-14-DataComp.XL-s13B-b90K/resolve/main/special_tokens_map.json to /raid/1moritz/models/languagebind/cache_dir/tmp0exqanes 2024-04-19,13:59:31 | INFO | {'vl_ret': [{'msrvtt': <torch.utils.data.dataloader.DataLoader object at 0x7f9015f066b0>}]}) 2024-04-19,13:59:31 | INFO | Eval Epoch: 0, eval Video-Text Retrieval under MSRVTT test data 2024-04-19,14:06:35 | INFO | MSRVTT sim matrix size: 1000, 1000 2024-04-19,14:06:35 | INFO | Length-T: 1000, Length-V:1000 2024-04-19,14:06:35 | INFO | MSRVTT Text-to-Video: 2024-04-19,14:06:35 | INFO | >>> R@1: 0.0 - R@5: 0.4 - R@10: 0.7 - Median R: 511.0 - Mean R: 505.5 2024-04-19,14:06:35 | INFO | MSRVTT Video-to-Text: 2024-04-19,14:06:35 | INFO | >>> V2T$R@1: 0.2 - V2T$R@5: 0.6 - V2T$R@10: 0.9 - V2T$Median R: 500.0 - V2T$Mean R: 504.9

How to reproduce:

I follow TRAIN_AND_VALIDATE.md.

  1. Download cache of pretrained weights from your google drive and specify CACHE_DIR.
  2. Download MSRVTT from the source you mentioned in TRAIN_AND_VALIDATE.md
  3. Change the data_root here.
  4. Make minimal changes to eval.sh and save it as eval_msrvtt.sh. Then execute the script.

This is my eval_msrvtt.sh:

CACHE_DIR="/raid/1moritz/models/languagebind/cache_dir"
RESUME="video_language.pt"
ANNOTATION="path/to/data"
# this script is for 640 total batch_size (n(16) GPUs * batch_size(10) * accum_freq(4))
cd /srv/home/1moritz/Repositories/LanguageBind
# TORCH_DISTRIBUTED_DEBUG=DETAIL HF_DATASETS_OFFLINE=1 TRANSFORMERS_OFFLINE=1 torchrun --nnodes=$HOST_NUM --node_rank=$INDEX --nproc_per_node $HOST_GPU_NUM --master_addr $CHIEF_IP \
TORCH_DISTRIBUTED_DEBUG=DETAIL HF_DATASETS_OFFLINE=1 TRANSFORMERS_OFFLINE=1 torchrun --nproc_per_node 1 \
    -m main  \
    --train-data ${ANNOTATION} \
    --train-num-samples 3020000 \
    --clip-type "vl" --add-time-attn \
    --lock-text --lock-image --text-type "polish_mplug" \
    --init-temp 0.07 --learn-temp \
    --model "ViT-L-14" --cache-dir ${CACHE_DIR} \
    --convert_to_lora --lora_r 16 \
    --lr 1e-4 --coef-lr 1 \
    --beta1 0.9 --beta2 0.98 --wd 0.2 --eps 1e-6 \
    --num-frames 8 --force-patch-dropout 0.3 \
    --epochs 16 --batch-size 10 --accum-freq 4 --warmup 2000 \
    --precision "amp" --workers 10 --video-decode-backend "imgs" \
    --save-frequency 1 --log-every-n-steps 20 --report-to "tensorboard" --resume "latest" \
    --do_eval \
    --val_vl_ret_data "msrvtt"
e1four15f commented 2 weeks ago

Hi @lennartmoritz, I'm currently using this model for my project and I'm having the same issue with eval_msrvtt.sh.

I wrote my own script for model evaluation. Unfortunatelly, FT models does not show the expected results, but Large models are ok (LanguageBind_Video, LanguageBind_Audio)

You may try run my script, it gave me around 41.50 R@1, 65.80 R@5, 75.50 R@10

from collections import defaultdict

import torch
import pandas as pd
import numpy as np
from more_itertools import chunked
from tqdm.auto import tqdm

from languagebind import LanguageBind, to_device, transform_dict, LanguageBindImageTokenizer

def compute_metrics(x):
    sx = np.sort(-x, axis=1)
    d = np.diag(-x)
    d = d[:, np.newaxis]
    ind = sx - d
    ind = np.where(ind == 0)
    ind = ind[1]
    metrics = {}
    metrics['R1'] = float(np.sum(ind == 0)) * 100 / len(ind)
    metrics['R5'] = float(np.sum(ind < 5)) * 100 / len(ind)
    metrics['R10'] = float(np.sum(ind < 10)) * 100 / len(ind)
    metrics['MR'] = np.median(ind) + 1
    metrics["MedianR"] = metrics['MR']
    metrics["MeanR"] = np.mean(ind) + 1
    # metrics["cols"] = [int(i) for i in list(ind)]
    return metrics

def main():
    device = torch.device('cuda:0')
    clip_type = {
        'video': 'LanguageBind_Video',#_FT',  # also LanguageBind_Video
        'audio': 'LanguageBind_Audio',#_FT',  # also LanguageBind_Audio
        # 'image': 'LanguageBind_Image',
        # 'thermal': 'LanguageBind_Thermal',
        # 'depth': 'LanguageBind_Depth',
    }

    model = LanguageBind(clip_type=clip_type, cache_dir='./cache_dir').to(device)
    model.eval()

    tokenizer = LanguageBindImageTokenizer.from_pretrained('lb203/LanguageBind_Image', cache_dir='./cache_dir/tokenizer_cache_dir')
    modality_transform = {c: transform_dict[c](model.modality_config[c]) for c in clip_type.keys()}

    df = pd.read_csv('../data/MSRVTT/MSRVTT_JSFUSION_test.csv')

    language_data = df['sentence'].values.tolist()
    video_data = df['video_id'].apply(lambda x: str(f'../data/MSRVTT/videos/all/{x}.mp4')).values.tolist()

    def embed(x: list[list], dtypes: list[str]) -> list:
        inputs = {}
        for data, dtype in zip(x, dtypes):
            if dtype == 'language':
                inputs['language'] = to_device(tokenizer(data, max_length=77, padding='max_length', truncation=True, return_tensors='pt'), device)
            elif dtype in ['image', 'video', 'audio', 'depth', 'thermal', 'language']:
                inputs[dtype] = to_device(modality_transform[dtype](data), device)
            else:
                raise

        with torch.no_grad():
            embeddings = model(inputs)

        embeddings = {k: v.detach().cpu().numpy() for k, v in embeddings.items()}
        return embeddings

    batch_size = 16
    results = defaultdict(lambda: np.random.rand(0, 768))
    for batch in tqdm(list(zip(
            chunked(language_data, batch_size),
            chunked(video_data, batch_size)
        ))):
        embeddings = embed(
            batch,
            dtypes=['language', 'video']
        )
        results['language'] = np.concatenate([results['language'], embeddings['language']])
        results['video'] = np.concatenate([results['video'], embeddings['video']])

    video = results['video']
    language = results['language']

    np.save('experiments/MSR-VTT_test_video_embeddings.npy', video)
    np.save('experiments/MSR-VTT_test_language_embeddings.npy', language)

    sim_matrix = torch.tensor(video @ language.T)
    print('VT', compute_metrics(sim_matrix))
    print('TV', compute_metrics(sim_matrix.T))

if __name__ == '__main__':
    main()
lennartmoritz commented 2 weeks ago

Hey @e1four15f thank you for your code example. In the mean time, i wrote a similar script to yours based on the inference example script from the repo. But i've noticed, that this is considerably slower than when i used the eval script. I suspect it has to do with the used batch sizes. Have you found a way to select a batch size for inference with your script?