Question about the Text Input to the LLM

ShramanPramanick commented 2 months ago

Thanks for your amazing work!

I have a question about the textual input to the LLM. The input_ids from the dataset contains the long message about when the frames are sampled at. An example is:

The video contains 96 frames sampled at 0.0, 0.2, 0.5, 0.7, 0.8, 1.0, 1.3, 1.5, 1.7, 1.8, 2.0, 2.3, 2.7, 2.8, 3.2, 3.7, 4.0, 4.2, 4.5, 4.7, 4.8, 5.0, 5.3, 5.7, 5.8, 6.0, 6.2, 6.5, 6.7, 6.8, 7.0, 7.3, 7.5, 7.7, 8.0, 8.2, 8.8, 9.0, 9.2, 9.3, 9.5, 9.8, 10.0, 10.2, 10.5, 11.0, 11.3, 11.5, 12.0, 12.2, 12.3, 12.5, 12.7, 12.8, 13.0, 13.2, 13.5, 13.7, 13.8, 14.2, 14.3, 14.5, 14.8, 15.0, 15.2, 15.3, 15.5, 15.7, 15.8, 16.0, 16.2, 16.3, 16.7, 16.8, 17.0, 17.2, 17.3, 17.5, 17.7, 18.0, 18.2, 18.5, 18.7, 18.8, 19.0, 19.2, 19.3, 19.5, 19.8, 20.2, 20.3, 20.7, 20.8, 21.2, 21.5, 21.7 seconds.

However, the frame embedding the replaced by the Video QFormer features (Lines 417-440 in timechat.py) which do not represent individual frames anymore, since the Video QFormer incorporates temporal correspondence across frame embeddings. If so, isn't the above text input is confusing for the LLM, as the LLM might represent the visual features as standalone frame embeddding?

Moreover, since the frames are already timestamped by the Image QFormer, what is the requirement of this long text denoting which frames comes from which second at all?

RenShuhuai-Andy commented 2 months ago

Hi, thanks for your interest.

This is a common practice in video-llama and videochat, we just keep it, even we have used the time-aware frame encoder.

I don't like this long textual input as well, but I haven't done a careful ablation study for this input (I guess it's still beneficial for LLM if we input explicit textual time information). We plan to do so in our next version.

ShramanPramanick commented 1 month ago

Thanks Shuhuai, I appreciate your opinion.

RenShuhuai-Andy / TimeChat

Question about the Text Input to the LLM #42