Open TalRub104 opened 4 days ago
Thank you for your feedback. We referred to the example of s4 training on CIFAR10 (https://github.com/state-spaces/s4/blob/main/example.py) for the module connections, and residual connections are almost the most commonly used method for stacking modules currently. To align the structures of all models, we uniformly adopted residual connections (Mega already includes a gated residual connection in its basic block structure, so we did not add an additional residual connection for Mega).
Hi, In the S6_SSM forward function, you perform the following:
for layer in self.mapping_layers: residual = x x = layer(x, state) x = residual + x x = self.normsnum_layer num_layer += 1
In your paper, you did not specify that you add a residual to the Mamba layer output. Could you clarify why you chose to do that?