关于代码SS2D中的d_state设置

MzeroMiko / VMamba

VMamba: Visual State Space Models，code is based on mamba

MIT License

1.82k stars 98 forks source link

Open chenzean opened 1 month ago

chenzean commented 1 month ago

作者你好，我有几个问题想问：

我看您在v9版本中d_state设置为1。我很好奇作者为啥设置为1？矩阵A的维度不是和d_state有关吗，那这样的话矩阵A就变成[D,1]的维度，这样的是表示什么意思呢？
我想如果把d_state设置成H*W是不是就是注意力呢？
关于矩阵B和矩阵C的离散化,代码实现如下：而Mamba论文中的描述是：无法直观的对应起来，要如何去理解呢？

如果这些问题过于冒犯，我想说对不起，但还是很期待作者的回复，谢谢。

MzeroMiko commented 1 month ago

we assume that even if the dimension of hidden states is 1, B and C would capture the foreground information. Basically, d_state is set to 1 to gain more throughput.
yes, it will be more like transformer, except for the gate mechanism ($e^{A\Delta}$). you can refer to the arxiv paper https://arxiv.org/pdf/2401.10166v2 page 6 for more details. (just as I said, SSM is transformer, or transformer is SSM.)
you can refer to https://arxiv.org/pdf/2401.10166v2 the page 5 for detailed information with discretization.

chenzean commented 1 month ago

感谢作者的回复，关于问题2和问题3我再去仔细阅读论文。但是我还有一个问题想问：不知道作者有没有做过关于d_state有关的消融实验？因为我直观的感觉这个d_state会控制记忆范围。(我对于d_state有关的理解可能存在错误，希望作者可以指出我的错误)。我依然期待作者的回复。非常感谢作者。

MzeroMiko commented 1 month ago

We finds that its the gate mechanism (w in the arxiv paper) which controls the memory range.
We did ablations about the hyperparameter dstate ssmratio mlpratio layer numbers initialization and so on，you can find corresponding code and configs files in the repo.

chenzean commented 1 month ago

作者，你好。我昨天又重新仔细阅读了一下您写的离散化公式(可能是我还没有彻底理解)，我对于您在一次线上的汇报中说然后我去对着公式看代码，但是我在这部分仍然还是无法对应起来，不知道作者能否指导一下呢？真的很期待作者能够回复，非常想将VMamba应用于我自己的研究领域中

MzeroMiko commented 1 month ago

Actually the discrezitation is in cuda code, while the code in python is just for the parameters peraparation.

chenzean commented 1 month ago

噢噢噢噢好滴，谢谢作者的回复。此外，我仍然还有几个问题想问：

在上述用于参数分析的代码中x_dbl表示什么含义？(因为我理解B，C，dts分别表示矩阵B，矩阵C以及Δ，但是不理解x_dbl的含义)
关于CrossScan和CrossMerge作者为什么要继承torch.autograd.Function，并且要实现backward？我认为CrossScan和CrossMerge只是维度的变化，似乎并没有什么对张量进行处理。

MzeroMiko commented 1 month ago

x_dbl is actually not exist, it is only the combination of B, C, dts for simplicity.
Yes CrossScan and CrossMerge is only for arranging the sequence. But if we do not implement the backward function, the backward would be slow as the autograd should follow the trace back exactly with path it goes in forward.

chenzean commented 1 month ago

谢谢作者的耐心回复，我理解了。我再去阅读/修改扫描代码(蛇形Scan和蛇形Merge) (如果后续有什么问题我再来请教作者)。真的，万分感谢作者的回复。

chenzean commented 1 month ago

作者，你说的“if we do not implement the backward function, the backward would be slow”是不是会增加训练或者推断的时间呢？