Open chenzean opened 1 month ago
we assume that even if the dimension of hidden states is 1, B and C would capture the foreground information. Basically, d_state is set to 1 to gain more throughput.
yes, it will be more like transformer, except for the gate mechanism ($e^{A\Delta}$). you can refer to the arxiv paper https://arxiv.org/pdf/2401.10166v2 page 6 for more details. (just as I said, SSM is transformer, or transformer is SSM.)
you can refer to https://arxiv.org/pdf/2401.10166v2 the page 5 for detailed information with discretization.
感谢作者的回复,关于问题2和问题3我再去仔细阅读论文。但是我还有一个问题想问: 不知道作者有没有做过关于d_state有关的消融实验?因为我直观的感觉这个d_state会控制记忆范围。(我对于d_state有关的理解可能存在错误,希望作者可以指出我的错误)。 我依然期待作者的回复。非常感谢作者。
We finds that its the gate mechanism (w in the arxiv paper) which controls the memory range.
We did ablations about the hyperparameter dstate ssmratio mlpratio layer numbers initialization and so on,you can find corresponding code and configs files in the repo.
作者,你好。我昨天又重新仔细阅读了一下您写的离散化公式(可能是我还没有彻底理解),我对于您在一次线上的汇报中说 然后我去对着公式看代码, 但是我在这部分仍然还是无法对应起来,不知道作者能否指导一下呢? 真的很期待作者能够回复,非常想将VMamba应用于我自己的研究领域中
Actually the discrezitation is in cuda code, while the code in python is just for the parameters peraparation.
噢噢噢噢好滴,谢谢作者的回复。 此外,我仍然还有几个问题想问:
x_dbl is actually not exist, it is only the combination of B, C, dts for simplicity.
Yes CrossScan and CrossMerge is only for arranging the sequence. But if we do not implement the backward function, the backward would be slow as the autograd should follow the trace back exactly with path it goes in forward.
谢谢作者的耐心回复,我理解了。 我再去阅读/修改扫描代码(蛇形Scan和蛇形Merge) (如果后续有什么问题我再来请教作者)。 真的,万分感谢作者的回复。
作者,你说的“if we do not implement the backward function, the backward would be slow”是不是会增加训练或者推断的时间呢?
作者你好,我有几个问题想问:
如果这些问题过于冒犯,我想说对不起,但还是很期待作者的回复,谢谢。