TencentARC / ST-LLM

[ECCV 2024🔥] Official implementation of the paper "ST-LLM: Large Language Models Are Effective Temporal Learners"
Apache License 2.0
80 stars 2 forks source link

MVM loss #8

Closed mutonix closed 2 months ago

mutonix commented 2 months ago

Is the mvm loss the MSE loss or the cosine similarity loss? In the paper, the mvm loss is the MSE, but in the code the loss is different.

https://github.com/TencentARC/ST-LLM/blob/e566966f7ea4a07a8e3e5b64dab25de74d077010/stllm/models/st_llm.py#L91

farewellthree commented 2 months ago

Sorry for the confusion, but this code is one way of implementing MSE. Compared to nn.MSELoss, this implementation yields results that are several orders of magnitude larger, similar in scale to the decode loss. Therefore, it eliminates the need to tune the weight of the loss.

mutonix commented 2 months ago

Thanks for your reply. I have another question about the text tokens concatenated to the query tokens in the qformer: Will this additional text query improve the model? Is there any evidence to show the benefits?

farewellthree commented 2 months ago

This operation is derived from InstructBLIP, and its ablation results are quite significant in its paper. When questions are input together with QFormer, it can extract video features relevant to the questions, thereby significantly enhancing the model's ability to follow instructions. Additionally, text input to QFormer may have other uses, such as TimeChat, which utilizes this operation to record timestamps.

mutonix commented 2 months ago

Many thanks for your patient reply. This model is very powerful. We are now trying the ST-LLM on our own dataset Vript.