How's the performance on language task since the forget gate has been replaced?

LeapLabTHU / MLLA

Official repository of MLLA

127 stars 4 forks source link

How's the performance on language task since the forget gate has been replaced? #7

Closed kyrie-23 closed 3 weeks ago

kyrie-23 commented 3 weeks ago

Thanks for the amazing work! You've got SOTA results on vision tasks due to positional encoding design. I'm just curious about the performance on the language task since the forget gate has been replaced. Will the positional encoding maintain a comparable performance while parallelizable computation?

Thanks again for the amazing work and shared checkpoints!

tian-qing001 commented 3 weeks ago

Hi @kyrie-23, thanks for your attention to our work. As discussed in our paper, the forget gate is ideally suited for language data, which naturally needs auto-regressive training and recurrent inference. Therefore, we think it may not be necessary to replace the forget gate with positional encodings in language models. Additionally, our work focuses on demystifying mamba in vision, and we leave the language tasks for future work.

kyrie-23 commented 3 weeks ago

Thanks for your reply, the forget gate is suited for language data while slower than parallel computation, which is a key reason for you to replace it with positional encoding. Since there're sufficient effective positional encodings in the transformer family, is it possible to use one of those in MLLA for language task, or the forget gate is the best solution for causal mode?

tian-qing001 commented 3 weeks ago

This is an interesting question. Currently, our work mainly focuses on demystifying mamba in vision, and we will explore the results on language tasks in the future.