RenShuhuai-Andy / TimeChat

[CVPR 2024] TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding
https://arxiv.org/abs/2312.02051
BSD 3-Clause "New" or "Revised" License
267 stars 23 forks source link

Seeking Clarification about Fine-tuning Datasets #12

Closed ShramanPramanick closed 6 months ago

ShramanPramanick commented 6 months ago

Hi Authors,

I congratulate you for the great work and CVPR 2024 acceptance. Could you clarify the number of samples used for fine-tuning?

It is mentioned in the paper that TimeChat is fine-tuned on TimeIT for three epochs. However, in the training config file train_configs/stage2_finetune_time104k_valley72k.yaml, both TimeIT and Valley datasets are listed for fine-tuning, which together results in a total of ~176K samples. Please let me know the reason behind using the Valley dataset.

Thanks, Shraman

RenShuhuai-Andy commented 6 months ago

Hi, thanks for your interest.

The results reported in the paper were obtained using the TimeIT + Valley dataset (we will note this more clearly in our paper update).

TimeIT is a timestamp-related instruction-tuning dataset, while Valley is a general video instruction-tuning dataset. The joint training with both datasets ensures that our model possesses a dual proficiency in both time-aware and general video comprehension.

ShramanPramanick commented 6 months ago

Hi @RenShuhuai-Andy, Thanks for the clarification. Please update the paper, as it does not mention that Valley is used for training. Thanks.