MzeroMiko / VMamba

VMamba: Visual State Space Models,code is based on mamba
MIT License
2.24k stars 147 forks source link

About Seed #257

Open zs1314 opened 4 months ago

zs1314 commented 4 months ago

Hello, I'm sorry to bother you! Your work is excellent! However, when I conducted my experiment, I found that after fixing the random seeds, I still had the problem of inconsistent training results, in this case, how can I determine if my improvement is effective. First of all, I also ruled out other environmental problems: I tested other models and they are reproducibly fixed after fixing the seed

MzeroMiko commented 4 months ago

What do you mean by “inconsistent training results”? It the performance differ from what we reported or differ across multiple experiments you done with the same config and seed?

zs1314 commented 4 months ago

@MzeroMiko differ across multiple experiments wo do with the same config and seed

MzeroMiko commented 4 months ago

Here's a trick: You can run N iters of training, and then print out torch.randn((1,)) to see whether the random seed is used up the same way across every experiments. The number it prints should be the same.

zs1314 commented 4 months ago

@MzeroMiko Thank you for your answer! I've actually tried it! I wonder if it's some operation in VSSM that can't be fixed by randomizing the seeds

MzeroMiko commented 4 months ago

In runtime, there should be no operation that can "eat" the random number, even the dataloader can not do that expect for the initialization or re-initialization. So if the random number you print is not the same, there must be something wrong with it.

zs1314 commented 4 months ago

@MzeroMiko I've tried the print random number operation and it's the same every time. This means that my SEED fix is working. But the inference and backpropagation of the mamba seems to be different each time! I would like to know if you experimented with a fixed seed operation and how did you do it? Or did you use multiple experiments to take an average?

zs1314 commented 4 months ago

@MzeroMiko Perhaps you can do an epoch of training after fixing the seed to see if the train loss and val loss, val acc are the same for both trainings under the same seed

MzeroMiko commented 4 months ago

actually there's a file called vmamba_check.py where there's code for checking the difference between different versions of code. That van be easily modified to check whether the running of vssm will change or keep the same across time.

MzeroMiko commented 4 months ago

@MzeroMiko Perhaps you can do an epoch of training after fixing the seed to see if the train loss and val loss, val acc are the same for both trainings under the same seed

So have you done this experiment? what is the conclusion?

zs1314 commented 4 months ago

@MzeroMiko Perhaps you can do an epoch of training after fixing the seed to see if the train loss and val loss, val acc are the same for both trainings under the same seed

So have you done this experiment? what is the conclusion?

@MzeroMiko Unfortunately, I tried. The results were inconsistent between the two times.Also, I have tested other convnext and Vit models and they are consistent `def seed(seed=0): random.seed(seed) np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
os.environ['PYTHONHASHSEED'] = str(seed) torch.cuda.manual_seed_all(seed)
torch.backends.cudnn.benchmark = False torch.backends.cudnn.deterministic = True torch.backends.cudnn.enabled = True os.environ['CUDA_LAUNCH_BLOCKING'] = '1'

os.environ['CUBLAS_WORKSPACE_CONFIG'] = ':16:8'

# torch.use_deterministic_algorithms(True)

`

zs1314 commented 4 months ago

@MzeroMiko New discovery: I've found this happening with other related visual mamba, like PlainMamba, etc.

ChenJunhao-Fighting commented 4 months ago

@MzeroMiko @zs1314 Hi, I also encountered this problem, have you solved it?

zs1314 commented 4 months ago

No!I only choose the average expernment result

---Original--- From: "Junhao @.> Date: Mon, Jul 22, 2024 18:07 PM To: @.>; Cc: @.**@.>; Subject: Re: [MzeroMiko/VMamba] About Seed (Issue #257)

@MzeroMiko @zs1314 Hi, I also encountered this problem, have you solved it?

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

ChenJunhao-Fighting commented 4 months ago

不!我只选择平均实验结果

thank you for your reply!

MzeroMiko commented 4 months ago

Thank you all for your findings, I will try to find out the reason in days once the machine is available.

MzeroMiko commented 4 months ago

I did a small experiment which shows that the difference may come from the gradient calculation in B, C, D, delta_bias. For more details, you can refer to selective_scan_bwd_kernel_oflex.cuh or selective_scan_bwd_kernel.cuh. I guess that the reason may be the random order of adding across different Blocks when using gpuAtomicAdd, as float(a) + float(b) does not always equal to float(b) + float(a).

However, though gpuAtomicAdd is also used when calculating the gradient of A, it is interesting that dA is not influenced and shows consistency across every run.

import torch
from models.csms6s import selective_scan_cuda_oflex, selective_scan_cuda

def s6fb(x: torch.Tensor, delta_softplus=True, oflex=True, backend=None):
    # x: B, k*C, L
    x = x.flatten(2, 3)
    N = 1
    K = 1
    B, KC, L = x.shape
    C = KC // K
    u = x
    delta = x.sigmoid().view(B, KC, L)
    A = -x.sigmoid().sum(0)[:, :N].view(KC, N)
    B = x.sigmoid().view(B, K, C, L)[:, :, :N, :]
    C = B + 1
    D = x.tanh().sum(0).sum(-1).view(KC)
    delta_bias = delta.sum(0).sum(-1).view(KC)

    # out, x, *rest = selective_scan_cuda_oflex.fwd(u, delta, A, B, C, D, delta_bias, delta_softplus, 1, oflex)
    out, x, *rest = selective_scan_cuda.fwd(u, delta, A, B, C, D, None, delta_bias, delta_softplus)

    dout = out.sigmoid()
    # du, ddelta, dA, dB, dC, dD, ddelta_bias, *rest = selective_scan_cuda_oflex.bwd(
    #     u, delta, A, B, C, D, delta_bias, dout, x, delta_softplus, 1
    # )

    du, ddelta, dA, dB, dC, dD, ddelta_bias, *rest = selective_scan_cuda.bwd(
        u, delta, A, B, C, D, None, delta_bias, dout, x, None, None, delta_softplus,
        False
    )

    return out, x, du, ddelta, dA, dB, dC, dD, ddelta_bias

def setseed(seed = 0):
    import torch
    import numpy as np 
    import random
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    np.random.seed(seed)
    random.seed(seed)

    if True: 
        torch.backends.cudnn.enabled = True
        torch.backends.cudnn.benchmark = True
        torch.backends.cudnn.deterministic = True

B, C, H, W = 128, 96, 56, 56 # dB, dC, dD, ddelta_bias not consistent
"""
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(819.5000, device='cuda:0') tensor(4., device='cuda:0')
tensor(307.8750, device='cuda:0') tensor(1., device='cuda:0')
tensor(1.9688, device='cuda:0') tensor(0.0625, device='cuda:0')
tensor(1.0391, device='cuda:0') tensor(0.0469, device='cuda:0')
"""
B, C, H, W = 2, 96, 56, 56 # dB, dC not consistent
"""
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(89.7734, device='cuda:0') tensor(0.1250, device='cuda:0')
tensor(29.1208, device='cuda:0') tensor(0.0312, device='cuda:0')
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
"""
B, C, H, W = 128, 2, 56, 56 # dD, ddelta_bias not consistent
"""
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(0.0312, device='cuda:0') tensor(0.0312, device='cuda:0')
tensor(0.0156, device='cuda:0') tensor(0.0156, device='cuda:0')
"""
B, C, H, W = 2, 2, 56, 56 # all consistent
"""
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
"""
setseed(0)
im1 = torch.randn((B, C, H, W)).cuda().requires_grad_()
out1 = s6fb(im1, backend="mamba")

setseed(0)
im2 = torch.randn((B, C, H, W)).cuda().requires_grad_()
out2 = s6fb(im2, backend="mamba")

for o1, o2 in zip(out1, out2):
    print((o1 - o2).abs().sum(), (o1 - o2).abs().max())
ChenJunhao-Fighting commented 3 months ago

我做了一个小实验,发现差异可能来自于 中的梯度计算B, C, D, delta_bias。有关更多详细信息,您可以参考selective_scan_bwd_kernel_oflex.cuhselective_scan_bwd_kernel.cuh。我猜测原因可能是Blocks在使用 时在不同 中添加的顺序是随机的gpuAtomicAdd,因为float(a) + float(b)并不总是等于float(b) + float(a)

然而,尽管gpuAtomicAdd在计算的梯度时也会用到A,但有趣的是,它dA并不受影响,并且在每次运行中都保持一致。

import torch
from models.csms6s import selective_scan_cuda_oflex, selective_scan_cuda

def s6fb(x: torch.Tensor, delta_softplus=True, oflex=True, backend=None):
    # x: B, k*C, L
    x = x.flatten(2, 3)
    N = 1
    K = 1
    B, KC, L = x.shape
    C = KC // K
    u = x
    delta = x.sigmoid().view(B, KC, L)
    A = -x.sigmoid().sum(0)[:, :N].view(KC, N)
    B = x.sigmoid().view(B, K, C, L)[:, :, :N, :]
    C = B + 1
    D = x.tanh().sum(0).sum(-1).view(KC)
    delta_bias = delta.sum(0).sum(-1).view(KC)

    # out, x, *rest = selective_scan_cuda_oflex.fwd(u, delta, A, B, C, D, delta_bias, delta_softplus, 1, oflex)
    out, x, *rest = selective_scan_cuda.fwd(u, delta, A, B, C, D, None, delta_bias, delta_softplus)

    dout = out.sigmoid()
    # du, ddelta, dA, dB, dC, dD, ddelta_bias, *rest = selective_scan_cuda_oflex.bwd(
    #     u, delta, A, B, C, D, delta_bias, dout, x, delta_softplus, 1
    # )

    du, ddelta, dA, dB, dC, dD, ddelta_bias, *rest = selective_scan_cuda.bwd(
        u, delta, A, B, C, D, None, delta_bias, dout, x, None, None, delta_softplus,
        False
    )

    return out, x, du, ddelta, dA, dB, dC, dD, ddelta_bias

def setseed(seed = 0):
    import torch
    import numpy as np 
    import random
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    np.random.seed(seed)
    random.seed(seed)

    if True: 
        torch.backends.cudnn.enabled = True
        torch.backends.cudnn.benchmark = True
        torch.backends.cudnn.deterministic = True

B, C, H, W = 128, 96, 56, 56 # dB, dC, dD, ddelta_bias not consistent
"""
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(819.5000, device='cuda:0') tensor(4., device='cuda:0')
tensor(307.8750, device='cuda:0') tensor(1., device='cuda:0')
tensor(1.9688, device='cuda:0') tensor(0.0625, device='cuda:0')
tensor(1.0391, device='cuda:0') tensor(0.0469, device='cuda:0')
"""
B, C, H, W = 2, 96, 56, 56 # dB, dC not consistent
"""
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(89.7734, device='cuda:0') tensor(0.1250, device='cuda:0')
tensor(29.1208, device='cuda:0') tensor(0.0312, device='cuda:0')
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
"""
B, C, H, W = 128, 2, 56, 56 # dD, ddelta_bias not consistent
"""
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(0.0312, device='cuda:0') tensor(0.0312, device='cuda:0')
tensor(0.0156, device='cuda:0') tensor(0.0156, device='cuda:0')
"""
B, C, H, W = 2, 2, 56, 56 # all consistent
"""
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
"""
setseed(0)
im1 = torch.randn((B, C, H, W)).cuda().requires_grad_()
out1 = s6fb(im1, backend="mamba")

setseed(0)
im2 = torch.randn((B, C, H, W)).cuda().requires_grad_()
out2 = s6fb(im2, backend="mamba")

for o1, o2 in zip(out1, out2):
    print((o1 - o2).abs().sum(), (o1 - o2).abs().max())

Thank you very much for your excellent work and response. Is there any way you can solve the inconsistency?

Lyan-ing commented 20 hours ago

I did a small experiment which shows that the difference may come from the gradient calculation in B, C, D, delta_bias. For more details, you can refer to selective_scan_bwd_kernel_oflex.cuh or selective_scan_bwd_kernel.cuh. I guess that the reason may be the random order of adding across different Blocks when using gpuAtomicAdd, as float(a) + float(b) does not always equal to float(b) + float(a).

However, though gpuAtomicAdd is also used when calculating the gradient of A, it is interesting that dA is not influenced and shows consistency across every run.

import torch
from models.csms6s import selective_scan_cuda_oflex, selective_scan_cuda

def s6fb(x: torch.Tensor, delta_softplus=True, oflex=True, backend=None):
    # x: B, k*C, L
    x = x.flatten(2, 3)
    N = 1
    K = 1
    B, KC, L = x.shape
    C = KC // K
    u = x
    delta = x.sigmoid().view(B, KC, L)
    A = -x.sigmoid().sum(0)[:, :N].view(KC, N)
    B = x.sigmoid().view(B, K, C, L)[:, :, :N, :]
    C = B + 1
    D = x.tanh().sum(0).sum(-1).view(KC)
    delta_bias = delta.sum(0).sum(-1).view(KC)

    # out, x, *rest = selective_scan_cuda_oflex.fwd(u, delta, A, B, C, D, delta_bias, delta_softplus, 1, oflex)
    out, x, *rest = selective_scan_cuda.fwd(u, delta, A, B, C, D, None, delta_bias, delta_softplus)

    dout = out.sigmoid()
    # du, ddelta, dA, dB, dC, dD, ddelta_bias, *rest = selective_scan_cuda_oflex.bwd(
    #     u, delta, A, B, C, D, delta_bias, dout, x, delta_softplus, 1
    # )

    du, ddelta, dA, dB, dC, dD, ddelta_bias, *rest = selective_scan_cuda.bwd(
        u, delta, A, B, C, D, None, delta_bias, dout, x, None, None, delta_softplus,
        False
    )

    return out, x, du, ddelta, dA, dB, dC, dD, ddelta_bias

def setseed(seed = 0):
    import torch
    import numpy as np 
    import random
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    np.random.seed(seed)
    random.seed(seed)

    if True: 
        torch.backends.cudnn.enabled = True
        torch.backends.cudnn.benchmark = True
        torch.backends.cudnn.deterministic = True

B, C, H, W = 128, 96, 56, 56 # dB, dC, dD, ddelta_bias not consistent
"""
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(819.5000, device='cuda:0') tensor(4., device='cuda:0')
tensor(307.8750, device='cuda:0') tensor(1., device='cuda:0')
tensor(1.9688, device='cuda:0') tensor(0.0625, device='cuda:0')
tensor(1.0391, device='cuda:0') tensor(0.0469, device='cuda:0')
"""
B, C, H, W = 2, 96, 56, 56 # dB, dC not consistent
"""
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(89.7734, device='cuda:0') tensor(0.1250, device='cuda:0')
tensor(29.1208, device='cuda:0') tensor(0.0312, device='cuda:0')
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
"""
B, C, H, W = 128, 2, 56, 56 # dD, ddelta_bias not consistent
"""
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(0.0312, device='cuda:0') tensor(0.0312, device='cuda:0')
tensor(0.0156, device='cuda:0') tensor(0.0156, device='cuda:0')
"""
B, C, H, W = 2, 2, 56, 56 # all consistent
"""
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
"""
setseed(0)
im1 = torch.randn((B, C, H, W)).cuda().requires_grad_()
out1 = s6fb(im1, backend="mamba")

setseed(0)
im2 = torch.randn((B, C, H, W)).cuda().requires_grad_()
out2 = s6fb(im2, backend="mamba")

for o1, o2 in zip(out1, out2):
    print((o1 - o2).abs().sum(), (o1 - o2).abs().max())

我也发现了这个问题,请问cuda算子里的不确定性能修复吗,我在做实验时每次的结果很有不小的差别