RenShuhuai-Andy / TimeChat

[CVPR 2024] TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding
https://arxiv.org/abs/2312.02051
BSD 3-Clause "New" or "Revised" License
267 stars 23 forks source link

Question about the tokenizer #19

Closed gyxxyg closed 5 months ago

gyxxyg commented 5 months ago

Hi, thank you for sharing the code.

I have some questions about the tokenizers used for the Qformer input text, such as This frame is sampled at 2s.

I noticed that in video_instruct_dataset.py, the text is tokenized using LLamaTokenizer, but in models/blip2.py and models/Qformer.py, BertTokenizer is used instead. I am curious about the reason for choosing this implementation.

RenShuhuai-Andy commented 5 months ago

Hi, thanks for your interest.

In video_instruct_dataset.py, we use the LLamaTokenizer to tokenize texts for LLM, including instruction, optional asr, and target output.

In blip/qformer, given that we adopt the pre-trained Q-former from InstructBLIP, we use the BertTokenizer to tokenize textual timestamps for Q-former, which aligns with the practice of InstructBLIP.

gyxxyg commented 5 months ago

Thank you for your response!

I have noticed that both the LLM inputs and outputs utilize the LLamaTokenizer. However, I have also found that the timestamps also use the LLamaTokenizer in video_instruct_dataset.py, and then they are fed into Qformer. I am a bit confusing about this and would appreciate your assistance.

mengqiDyangge commented 5 months ago

Thank you for your response!

I have noticed that both the LLM inputs and outputs utilize the LLamaTokenizer. However, I have also found that the timestamps also use the LLamaTokenizer in video_instruct_dataset.py, and then they are fed into Qformer. I am a bit confusing about this and would appreciate your assistance.

I also noticed this problem. The code uses LLamaTokenizer to tokenizer timestaps as the input of qformer.

RenShuhuai-Andy commented 5 months ago

@gyxxyg @mengqiDyangge hi bros, I found you are right, we didn't code this part very carefully... I'm not sure about the influence of using LLamaTokenizer for Q-former. Perhaps replacing it with Berttokenizer will achieve better performance.

gyxxyg commented 5 months ago

Thank you very much, I will try it.