Closed mutonix closed 2 months ago
Sorry for the confusion, but this code is one way of implementing MSE. Compared to nn.MSELoss, this implementation yields results that are several orders of magnitude larger, similar in scale to the decode loss. Therefore, it eliminates the need to tune the weight of the loss.
Thanks for your reply. I have another question about the text tokens concatenated to the query tokens in the qformer: Will this additional text query improve the model? Is there any evidence to show the benefits?
This operation is derived from InstructBLIP, and its ablation results are quite significant in its paper. When questions are input together with QFormer, it can extract video features relevant to the questions, thereby significantly enhancing the model's ability to follow instructions. Additionally, text input to QFormer may have other uses, such as TimeChat, which utilizes this operation to record timestamps.
Is the mvm loss the MSE loss or the cosine similarity loss? In the paper, the mvm loss is the MSE, but in the code the loss is different.
https://github.com/TencentARC/ST-LLM/blob/e566966f7ea4a07a8e3e5b64dab25de74d077010/stllm/models/st_llm.py#L91