Closed EIFY closed 1 year ago
Hello. Thanks for your reminder. Regarding the pre-training method you mentioned, I have had similar thoughts before. My previous idea was to generate sequences by learning autoregressive retrieval on a larger dataset, thereby expanding the existing downstream sequence samples. If this method is used for pre-training, verifying its feasibility through experiments is still necessary. We may conduct related experiments in the future.
Hey @cg1177, just curious:
Hi! I just went through your preprint, and here are my two quick reactions, if you don't mind:
Typo in the Figure 3 caption of the preprint
It should be “Unseen tokens”.
Possibility of token-retrieval pretraining
VideoLLM, especially the use of linear projector to map video tokens to tokens for the LLM, reminds me of https://github.com/kohjingyu/fromage. However, there is no equivalent of the Image-text retrieval pretraining task: i.e., given a description of the video, train the LLM to retrieve the correct video tokens in the correct order among all the video tokens in the same batch. Can it be a useful pretraining task here?