MambaMixer / M2

41 stars 2 forks source link

What is the ultimate goal of flipping? #2

Open dsdanielpark opened 5 months ago

dsdanielpark commented 5 months ago

Congratulations on the cool and quick work and outcomes.

Is the flip intended to capture more features? Is it from an augmentation perspective?

I'm curious if other attempts were made. Can you share rough experiments or ideation details that didn't make it into the official paper?

The idea of Mamba Mixer applying the Mamba architecture swiftly and intuitively overcoming Mamba's drawbacks was impressive. I think it will be a very valuable start. I'll wait for the official repo to update.

Thanks in advance.

ABehrouz commented 4 months ago

Hello,

Thank you for your kind words, and I am very sorry for the delay in my response.

The main goal of using flip is to make the model non-causal. That is, without flipping, each token has access to the information of previous tokens, but for example, channels are not causal and this bi-directionality can help to enhance the performance. 

Honestly, the current version is in its very preliminary stage and we didn't perform extensive experiments on different architecture designs. We started our experiments by using MLP and GLU (similar to HGRN) as channel mixing methods and then tried Mamba. In the next version, we are presenting a relaxed version for channel mixing, which helps to reduce the number of parameters without performance drop.

Thank you very much.