Several questions about rope and linear-attention

Journey7331 commented 1 month ago

Hi, @tian-qing001 Thanks for your excellent work. I have several questions.

In the linear attention mechanism of MLLA, why is 'v' not passed through the linear transformation?
Adding ROPE to the model might make it less friendly to inputs of various resolutions. Is the necessity of ROPE very high? Would it be okay to just use LePE and CPE?
If I understand correctly, in some ways, Mamba is a causal linear attention mechanism. But causality may not needed for images, so it might not be very suitable to directly apply Mamba in CV?

If something is wrong, plz correct me :)

tian-qing001 commented 1 month ago

Hi @Journey7331, thanks for your interest in our work.

As discussed in Sec. 5.2 and Fig. 3(c) of our paper, we omit the V projection before linear attention calculation since a similar input projection already exists.
RoPE can work with various resolutions, a simple solution is implemented here. RoPE provides global positional information, which may be important for downstream tasks. We recommend using RoPE.
Yes, causality may not be suitable for images. As analyzed in our paper, Mamba has to employ causal calculation, while our MLLA enjoys parallelizable computation. This is one of MLLA's key advantages over Mamba.

Journey7331 commented 1 month ago

Thx for your quick reply.

LeapLabTHU / MLLA