MzeroMiko / VMamba

VMamba: Visual State Space Models,code is based on mamba
MIT License
2.18k stars 138 forks source link

some problems about the 'chunksize' #137

Open sewkyz1 opened 7 months ago

sewkyz1 commented 7 months ago

Hello, Thank you for open-sourcing!

I have some questions about the "chunksize" parameter in the selective_scan module.

1.When examining the test_selective_scan_easy.py under VMamba-main/kernels/selective_scan/csrc/selective_scan, I noticed the "chunksize" parameter. Why does setting a large chunksize lead to Ats approaching 0? Does this have any impact on actual computations?

2.In selective_scan_fwd_kernel.cuh, I noticed that chunksize is set to 2048. Does this also cause the same issue mentioned in question 1? The code is: const int n_chunks = (seqlen + 2048 - 1) / 2048;

3.Will dividing the number of tokens, seq_len, into chunksize affect the value of h (state) and impact the module's ablity of extracting the global feature?

4.Is selective_scan_easy in test_selective_scan_easy.py equivalent to the CUDA version of selective_scan (disregarding runtime efficiency)? Additionally, I noticed that both test_selective_scan_easy.py and test_selective_scan.py are included in your provided mamba-mini project. What is the difference between these two files?

Thank you once again for your great work. I am looking forward to your response.

MzeroMiko commented 7 months ago

GOOD questions!

  1. bigger chunksize means bigger sum(\Delta_i), which leads to smaller exp(A sum(\Delta_i)). This do have impact on actual computation, but have no impact on forward procedure in cuda code.

  2. Not for the forward procedure, for the backward, there's really some influence, and the consequence is that if you are using S6 in float16, the training process would be unstable (loss NaN).

  3. As different chunksize do have numerical differences, this does affect the performance if you are about to train a model your self. But for inferencing, this slight change does no harm to the performance.

  4. No, they are different, but quite similar. I implemented the easy code just to 1. Understand the cuda code 2. Try to find a way to implement SSM more generally, but not only on CUDA.

sewkyz1 commented 7 months ago

Thank you for your detailed response.

However, I still have some questions regarding the "hprefix" parameter in test_selective_scan_easy.py.

hs = hs_tmp + Ats hprefix.unsqueeze(0) In the first chunk, hprefix in the above formula is initialized to 0. This will caused a change in the calculation formula of "h" in the first chunk to be "h = delta b * u", without "a".

Is there something wrong with my understanding? If my understanding is corrrect, will it cause some problems?

sewkyz1 commented 7 months ago

Additionally, is there any difference between the CUDA C++ implementation and the Python version regarding this detail?

MzeroMiko commented 6 months ago

Thank you for your detailed response.

However, I still have some questions regarding the "hprefix" parameter in test_selective_scan_easy.py.

hs = hs_tmp + Ats hprefix.unsqueeze(0) In the first chunk, hprefix in the above formula is initialized to 0. This will caused a change in the calculation formula of "h" in the first chunk to be "h = delta b * u", without "a".

Is there something wrong with my understanding? If my understanding is corrrect, will it cause some problems?

Yes, you are partially right, but not in the first chunk, but the first iteration, as A funcs in hs_tmp. In cuda implementation, that is also 0 in the first iter.

MzeroMiko commented 6 months ago

Additionally, is there any difference between the CUDA C++ implementation and the Python version regarding this detail?

There's no difference between CUDA implentation and torch implementation in theory, but with lots of numerical differences.

For example, even in torch, $c(a/b)$ differs from $a / (b/c)$,that is why $\partial(a/b) / \partial(b) = -(a/b) / b$ but is very different from $-a / (bb)$.