RenShuhuai-Andy / TimeChat

[CVPR 2024] TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding
https://arxiv.org/abs/2312.02051
BSD 3-Clause "New" or "Revised" License
267 stars 23 forks source link

Question about fune-tune #25

Closed zhengxingmao closed 3 months ago

zhengxingmao commented 4 months ago

Firstly, I would like to express my sincere appreciation for your work in this field. It is a project of great significance. Here is my question: After training with my own data, the model seems to only remember new information and forgets the old knowledge. What could be the reason for this? Also, I tried enabling the freezing option in the train_config, but after training, the model becomes significantly smaller, and running directly to save the weights results in an error. And here is the error message: image

RenShuhuai-Andy commented 4 months ago

Hi, thanks for your interests.

Please provide more information, for example:

  1. How many training samples do you have? What kind of tuning method do you use? What are the hyper-parameters (e.g., lr, bsz, epoch, etc.) do you use?

  2. What does "enabling the freezing option in the train_config" mean? You set freeze_qformer, frozen_llama_proj, and frozen_video_Qformer to Ture? In your log, the trainable parameters is 0.0% ... If so, you won't store the module of video_frame_position_embedding

zhengxingmao commented 4 months ago

Yes, I have set all these parameter options to True. I am using the provided train.py script for training. My custom dataset is quite small, consisting of approximately 10 records. I only intend to fine-tune an existing model. What are the correct steps to follow?

zhengxingmao commented 4 months ago

Here is my train config file. I have customized my dataset to be referred to as "time_instruct:" and "valley72k_instruct:". stage2_finetune_time104k_valley72k.zip

RenShuhuai-Andy commented 4 months ago

Because there are just 10 training samples, it is very easy to overfit. Using a minimum of hundreds of training samples is reasonable in my opinion.

Considering the number of training samples is small, I think you can set frozen_llama_proj: False and freeze_qformer&frozen_video_Qformer to True.

After that, when loading ckpt, I believe you should set ckpt_path to the pre-trained TimeChat ckpt, and ckpt_path_2 (https://github.com/RenShuhuai-Andy/TimeChat/blob/master/timechat/models/timechat.py#L594) to you finetuned ckpt.

zhengxingmao commented 4 months ago

Thank you for your advice, I now understand the issue. I'll increase the sample size later. Additionally, I'd like to inquire about how to make the model support Chinese question-answering, with responses also in Chinese. Do I simply switch to a Chinese language model? Any guidance you can offer would be appreciated.

zhengxingmao commented 4 months ago

Another topic for discussion is how to support parallel video inference in a model. Any suggestions? Is it feasible to use the torch.cat approach to concatenate videos into one and then process them?

RenShuhuai-Andy commented 4 months ago

Additionally, I'd like to inquire about how to make the model support Chinese question-answering, with responses also in Chinese. Do I simply switch to a Chinese language model? Any guidance you can offer would be appreciated.

Unfortunately, our model currently only supports English. It seems that it can understand Chinese questions, but cannot generate responses in Chinese. image I think it requires instruction-tuning on Chinese datasets to support Chinese.

Another topic for discussion is how to support parallel video inference in a model. Any suggestions? Is it feasible to use the torch.cat approach to concatenate videos into one and then process them?

Our evaluation code supports batch-level inference; you can specify batch size in https://github.com/RenShuhuai-Andy/TimeChat/blob/master/eval.sh#L45C107-L45C122. The related codes are in https://github.com/RenShuhuai-Andy/TimeChat/blob/master/timechat/conversation/conversation_video_batch.py