Unable to reproduce the results reported in your paper

Hi Authors,

Thanks for your great work first! It's an amazing contribution to the video understanding task!

However, when I try to reproduce the results reported in the paper, I get several troubles.

I follow the training script in this repo and pretrain / finetune the model on 8 A100 GPU, and perform evaluation on MSVD dataset. However, the accuracy is very low:

Yes count: 5350
No count: 7802
Accuracy: 0.406782
Average score: 2.606600

Total Score Yes/No distribution:
yes:
0: 0
1: 0
2: 0
3: 2
4: 1432
5: 3916
no:
0: 3401
1: 78
2: 4137
3: 139
4: 36
5: 11

Answer Type Score distribution:
Type, Accuracy, Avg_score
total, 0.406782, 2.606600

acc, score, total
0.406782, 2.606600, 0.406782
~                                                                                                                                                                                                                   
~

And when I try to use the provided checkpoint https://huggingface.co/IVGSZ/Flash-VStream-7b to perform evaluation, however, I got the following error:

Traceback (most recent call last):
  File "/sh/Flash-VStream-main/flash_vstream/eval_video/model_msvd_qa_featuresloader.py", line 181, in <module>
    run_inference(args)
  File "/sh/Flash-VStream-main/flash_vstream/eval_video/model_msvd_qa_featuresloader.py", line 150, in run_inference
    output_ids = model.generate(
  File "/sh/anaconda3/envs/python/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/sh/anaconda3/envs/python/lib/python3.10/site-packages/transformers/generation/utils.py", line 1588, in generate
    return self.sample(
  File "/sh/anaconda3/envs/python/lib/python3.10/site-packages/transformers/generation/utils.py", line 2678, in sample
    next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

Could you please help me with the problems? Or if there are somewhere that I made something wrong?

Thanks!

bytedance / Flash-VStream

Unable to reproduce the results reported in your paper #2