OpenGVLab / InternVideo

[ECCV2024] Video Foundation Models & Data for Multimodal Understanding
Apache License 2.0
1.31k stars 85 forks source link

Can't Reproduce Zero Shot Performance MSRVTT and LSMDC with Intervid-10m-FLT Checkpoint #139

Open fmthoker opened 3 months ago

fmthoker commented 3 months ago

Dear Authors, I am trying to reproduce Zeroshot performance with the checkpoint ViCLIP-L-14 InternVid-10M-FLT . However, the performance is different from reported numbers in the paper. Here are the results I obtain:

MSRVTT: txt_r1 txt_r5 txt_r10 txt_r_mean img_r1 img_r5 img_r10 img_r_mean r_mean msrvtt_1k_test/ 38.9 62.2 74.0 58.37 39.4 61.9 73.0 58.10 58.23 msrvtt_1k_test_emb/ 39.0 62.2 73.3 58.17 39.1 63.2 73.9 58.73 58.45

LSMDC:

txt_r1 txt_r5 txt_r10 txt_r_mean img_r1 img_r5 img_r10 img_r_mean r_mean test/ 15.2 29.0 35.6 26.6 17.8 32.1 40.1 30.00 28.30 test_emb/ 15.8 29.1 36.7 27.2 18.5 32.7 40.8 30.67 28.93

Here is the script that i run to obtain the performances:

source /ibex/user/thokerfm/anaconda3/bin/activate viclip export PYTHONPATH=.

MASTER_NODE=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1) MASTER_PORT=$((RANDOM % (65535 - 1024 + 1) + 1024))

echo $MASTER_NODE echo $MASTER_PORT

OUTPUT_DIR='expirements_zero_shot/ViClip-InternVid-10M-FLT/lsmdc/'

OMP_NUM_THREADS=1 torchrun --rdzv_endpoint=${MASTER_NODE}:${MASTER_PORT} \ --nnodes=1 \ --nproc_per_node=4 \ --rdzv_backend=c10d \ tasks/retrieval.py \ $(dirname $0)/config.py \ wandb.enable False \ train_corpus viclip \ evaluate True \ output_dir ${OUTPUT_DIR} \ model.vision_encoder.pretrained 'CLIP-ViT-L/14' \ model.text_encoder.pretrained 'CLIP-ViT-L/14' \ pretrained_path pretrained_viclip_models/ViClip-InternVid-10M-FLT.pth

leexinhao commented 3 months ago

I guess you didn't turn on wise ft. We average the internvid10M-fliered weights with the original CLIP weights during the test.

fmthoker commented 3 months ago

@leexinhao thanks for the reply, after evaluating with wise ft = True, indeed the results are better:

MSRVTT: txt_r1 txt_r5 txt_r10 txt_r_mean img_r1 img_r5 img_r10 img_r_mean r_mean msrvtt_1k_test/ 42.0 65.7 75.3 61.0 41.9 66.5 75.6 61.33 61.17 msrvtt_1k_test_emb/ 42.8 66.8 75.5 61.7 42.8 67.2 75.5 61.83 61.77

LSMDC: txt_r1 txt_r5 txt_r10 txt_r_mean img_r1 img_r5 img_r10 img_r_mean r_mean test/ 16.4 32.0 39.3 29.23 18.8 36.1 43.5 32.80 31.02 test_emb/ 17.9 33.2 40.8 30.63 18.7 36.9 44.5 33.37 32.00

Can you please confirm which numbers are reported in the paper ( test or test_emb) ?