Hi, @tian-qing001
Thanks for your excellent work.
I have several questions.
In the linear attention mechanism of MLLA, why is 'v' not passed through the linear transformation?
Adding ROPE to the model might make it less friendly to inputs of various resolutions. Is the necessity of ROPE very high? Would it be okay to just use LePE and CPE?
If I understand correctly, in some ways, Mamba is a causal linear attention mechanism. But causality may not needed for images, so it might not be very suitable to directly apply Mamba in CV?
Hi @Journey7331, thanks for your interest in our work.
As discussed in Sec. 5.2 and Fig. 3(c) of our paper, we omit the V projection before linear attention calculation since a similar
input projection already exists.
RoPE can work with various resolutions, a simple solution is implemented here. RoPE provides global positional information, which may be important for downstream tasks. We recommend using RoPE.
Yes, causality may not be suitable for images. As analyzed in our paper, Mamba has to employ causal calculation, while our MLLA enjoys parallelizable computation. This is one of MLLA's key advantages over Mamba.
Hi, @tian-qing001 Thanks for your excellent work. I have several questions.
In the linear attention mechanism of MLLA, why is 'v' not passed through the linear transformation?
Adding ROPE to the model might make it less friendly to inputs of various resolutions. Is the necessity of ROPE very high? Would it be okay to just use LePE and CPE?
If I understand correctly, in some ways, Mamba is a causal linear attention mechanism. But causality may not needed for images, so it might not be very suitable to directly apply Mamba in CV?
If something is wrong, plz correct me :)