facebookresearch / LaViLa

Code release for "Learning Video Representations from Large Language Models"
MIT License
478 stars 42 forks source link

Reproducing zero-shot eval results on EK100-MIR #11

Open melongua opened 1 year ago

melongua commented 1 year ago

Hi, I have downloaded the pretrained ckpt c89337, and use eval_zeroshot.py to evaluate on EK100_MIR in a zero-shot manner.

I prepared the dataset following the instruction follow the command: python eval_zeroshot.py --dataset ek100_mir --root datasets/EK100/video_ht256px/ --clip-length 4 --resume $PATH

The results I got are: mAP: V->T: 0.334 T->V: 0.251 AVG: 0.292 nDCG: V->T: 0.331 T->V: 0.300 AVG: 0.315

If I increase the clip_len from 4 to 16 as described in the paper, the results are: mAP: V->T: 0.341 T->V: 0.264 AVG: 0.303 nDCG: V->T: 0.335 T->V: 0.305 AVG: 0.320

Both seems to be much lower than the number reported in the paper: mAP: 36.1 , nDCG:34.6

May I ask what might be the cause of the performance gap ? Thanks in advance.

zhaoyue-zephyrus commented 1 year ago

Hi @melongua,

Can you provide some more details about (1) the EK100 data that you are using and (2) some other customized metadata e.g. the relevancy matrix? I believe these might have some effect on the final performance. We've uploaded the ones we used in this doc.

Best, Yue