antoyang / FrozenBiLM

[NeurIPS 2022] Zero-Shot Video Question Answering via Frozen Bidirectional Language Models
https://arxiv.org/abs/2206.08155
Apache License 2.0
156 stars 24 forks source link

Unexpected Zero-shot Results #5

Closed terryyz closed 1 year ago

terryyz commented 2 years ago

Hi,

I tried to evaluate the fine-tuned checkpoints provided in the repo. My environment has been correctly configured and I followed all steps up to Zero-shot VideoQA section. As I only have one GPU, I didn't use distributed inference. Here is what I used to run the evaluation: python videoqa.py --test --eval --combine_datasets <dataset> --combine_datasets_val <dataset> --save_dir=zs<dataset> --ds_factor_ff=8 --ds_factor_attn=8 --suffix="." --batch_size_val=32 --max_tokens=256 --load=checkpoints/frozenbilm_<dataset>.pth --<dataset>_vocab_path <data_folder>/vocab1000.json I tried with ActivityNet-VQA and iVQA and couldn't get any expected results. For instance, here is what got by testing on ActivityNet-VQA:

number of params: 29735424
loading from checkpoints/frozenbilm_activitynet.pth
test:  [  0/250]  eta: 0:07:27  acc: 0.0000 (0.0000)  time: 1.7891  data: 0.3052  max mem: 6485
test:  [100/250]  eta: 0:03:35  acc: 0.0000 (0.0006)  time: 1.4358  data: 0.0020  max mem: 7765
test:  [200/250]  eta: 0:01:11  acc: 0.0000 (0.0005)  time: 1.4355  data: 0.0021  max mem: 7765
test:  [249/250]  eta: 0:00:01  acc: 0.0000 (0.0006)  time: 1.4344  data: 0.0020  max mem: 7765
test: Total time: 0:05:59 (1.4361 s / it)
activitynet
test acc1:  0.06%
test acc10:  0.55%
acc motion:  0.00%
acc spatial:  0.12%
acc temporal:  0.00%
acc yesno:  0.00%
acc color:  0.57%
acc object:  0.00%
acc location:  0.00%
acc number:  0.00%
acc other:  0.00%
acc sub:  0.10%; proportion  25.25%

And results on iVQA: number of params: 29735424

loading from checkpoints/frozenbilm_ivqa.pth
test:  [ 0/63]  eta: 0:02:40  acc: 0.0000 (0.0000)  time: 2.5405  data: 0.2846  max mem: 6485
test:  [62/63]  eta: 0:00:01  acc: 0.0000 (0.0000)  time: 1.1953  data: 0.0018  max mem: 7766
test: Total time: 0:01:16 (1.2169 s / it)
ivqa
test acc1:  0.00%
test acc10:  0.95%
acc sub:  0.00%; proportion  14.20%

Do you have any ideas on this issue?

Cheers

antoyang commented 1 year ago

I would try the following things:

terryyz commented 1 year ago

Hi @antoyang, Thanks for pointing out! Please let me know if you can get expected results via the non-distribution way. Not too sure if this is due to my environment issue or some bugs in the implementation... Cheers

antoyang commented 1 year ago

I have just verified that the performance of a pretrained checkpoint is the same without distributing. Some notes: