Open Hongjiew opened 2 months ago
Hi, thanks for your careful reviewing. We will correct the order of layers in the Figure in paper.
For the setting of layers number, the current code is used for subsequent ablation experiments. I think that simply exchange define of the Mamba and attention layer can support more Mamba layers ratio :)
Thank you for your excellent work! However, I raised some questions regarding the Dimba architecture after reading the paper and the code.
TL;DR: the code seems to implement Dimba as Self-Attention -> Cross-Attention -> FFN -> BiMamba, which is different from Fig. 2 in the paper, indicating the architecture is Self-Attention -> Cross-Attention -> BiMamba -> FFN. In addition, it seems the number of SA, CA and FFN blocks is always equal or larger than the number of BiMamba blocks, which is different from Fig. 2 in the paper, indicating the possibility of stacking multiple BiMamba blocks with a SA-CA block.
In the definition of
Dimba
, blocks are wrapped up as follows (dimba.py
L388-394):self.blocks
is defined by a list ofDimbaBlock
(dimba.py
L306-315), andDimbaBlock
wraps up Self-Attention, Cross-Attention, and FFN following the order of SA -> CA -> FFN (dimba.py
L84-87).self.mamba_blocks
is defined by the functioncreate_block()
and is actually a list ofBlock
(dimba.py
L171-201), and eachBlock
ends with aMamba
block. The exhibited code above seems to indicate that:self.mamba_blocks
always follows a block inself.blocks
. It seems to indicate the architecture is SA -> CA -> FFN -> BiMamba, instead of SA -> CA -> BiMamba -> FFN, as shown in Fig. 2 of the paper.self.gap
> 1, some blocks inself.blocks
will not be followed by a Mamba block, which means the number of SA, CA, FFN blocks is larger than the number of Mamba blocks. Even ifself.gap
= 1, there will only be one Mamba block following theDimbaBlock
. It seems that Mamba blocks will never be more than SA, CA, FFN blocks, which is different from the case indicated by Fig. 2 in the paper.Thank you so much for your attention to this. It would be great if you could clarify the architecture of Dimba. Please let me know if my understanding of the code has errors.