HIPPO initialization with respect to the SSM2D module matrix A

LQchen1 commented 3 weeks ago

Thank you for your contribution. I have read the author's source code. In the SS2D model, the author initialized the A matrix to 0. However, in Mamba's paper, I found that the HIPPO initialization of the A matrix is crucial to the performance of the model. I would like to ask why the author did not adopt this method?

MzeroMiko commented 3 weeks ago

Actually, A matrix is initialized as HiPPO initialization in most cases in this repo, only in some ablation studies, the initialization method changed to rando, or zero.

In our observation, when dstate is small, also combined with cross scan and cross merge, A initialization is not that important.

LQchen1 commented 3 weeks ago

Actually, A matrix is initialized as HiPPO initialization in most cases in this repo, only in some ablation studies, the initialization method changed to rando, or zero.

In our observation, when dstate is small, also combined with cross scan and cross merge, A initialization is not that important.

Can we understand the phenomenon you mentioned like this: in language or time series tasks, the next output is often related to nearby sequences, so the information of sequences that are close together should be retained, and the data of sequences that are farther away should be compressed more strongly. However, there is no such relationship in visual tasks, so there is no need to use the HIPPO matrix to initialize it.

MzeroMiko commented 3 weeks ago

It is a good hypothesis. 👍

MzeroMiko / VMamba

HIPPO initialization with respect to the SSM2D module matrix A #217