RenShuhuai-Andy / TimeChat

[CVPR 2024] TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding
https://arxiv.org/abs/2312.02051
BSD 3-Clause "New" or "Revised" License
267 stars 23 forks source link

Questions about the provided fine-tuning model parameters #30

Closed LanXingXuan closed 3 months ago

LanXingXuan commented 4 months ago

Hello author, when I tested the performance of the provided timechat_7b.pth, I found that the measured indicators were lower than the results reported in the paper. I fine-tuned Timechat according to the requirements in the paper, the measured performance was higher than the provided timechat_7b.pth result. I would like to ask if there is something wrong with my fine-tuning/testing phase? Or are there errors in the fine-tuned model parameters provided?

Here are the results I got from testing the provided fine-tuning parameters timechat_7b.pth: (Because some videos are lost, the test data is nearly 20 less, but I guess it will not have a big impact on the results)

[val] gt video nums 396; pred video nums 396 gt video nums 396; pred video nums 396 evaluate data samples: 396 gt file: paragraph video captioning Para_CIDER 2.5 Para_METEOR 6.7 dense video captioning CIDER 2.4 METEOR 0.9 Precision@0.3 26.8 Recall@0.3 26.7 Precision@0.5 8.9 Recall@0.5 9.9 Precision@0.7 2.1 Recall@0.7 2.9 Precision@0.9 0.4 Recall@0.9 0.6 Precision_Mean 9.5 Recall_Mean 10.0 F1_Score 8.7 SODA_c_2 0.9 n_preds 7.6 SODA_c_1 -100.0

The following are the results of the fine-tuned checkpoint_2.pth that I reproduced myself:

[val] gt video nums 396; pred video nums 396 gt video nums 396; pred video nums 396 evaluate data samples: 396 gt file: paragraph video captioning Para_CIDER 2.1 Para_METEOR 8.1 dense video captioning CIDER 2.8 METEOR 1.0 Precision@0.3 31.1 Recall@0.3 43.5 Precision@0.5 11.0 Recall@0.5 17.9 Precision@0.7 3.4 Recall@0.7 6.3 Precision@0.9 0.4 Recall@0.9 0.8 Precision_Mean 11.5 Recall_Mean 17.1 F1_Score 12.4 SODA_c_2 1.2 n_preds 11.0 SODA_c_1 -100.0

RenShuhuai-Andy commented 4 months ago

Hi, thanks for your interest.

The zero-shot performance of timechat_7b.pth on YouCook2 (as shown in the following img) should be higher than the results in our paper: image

If not, please check that if you correctly transform the video to "youcook2_6fps_224" (see https://github.com/RenShuhuai-Andy/TimeChat/blob/master/docs/DATA.md#compressing-videos)

Which dataset did you use for tine-tuning? TimeIT? If so, I think the results are comparable to the results in our paper.