MzeroMiko / VMamba

VMamba: Visual State Space Models,code is based on mamba
MIT License
1.91k stars 103 forks source link

About Seed #257

Open zs1314 opened 2 weeks ago

zs1314 commented 2 weeks ago

Hello, I'm sorry to bother you! Your work is excellent! However, when I conducted my experiment, I found that after fixing the random seeds, I still had the problem of inconsistent training results, in this case, how can I determine if my improvement is effective. First of all, I also ruled out other environmental problems: I tested other models and they are reproducibly fixed after fixing the seed

MzeroMiko commented 2 weeks ago

What do you mean by “inconsistent training results”? It the performance differ from what we reported or differ across multiple experiments you done with the same config and seed?

zs1314 commented 2 weeks ago

@MzeroMiko differ across multiple experiments wo do with the same config and seed

MzeroMiko commented 2 weeks ago

Here's a trick: You can run N iters of training, and then print out torch.randn((1,)) to see whether the random seed is used up the same way across every experiments. The number it prints should be the same.

zs1314 commented 2 weeks ago

@MzeroMiko Thank you for your answer! I've actually tried it! I wonder if it's some operation in VSSM that can't be fixed by randomizing the seeds

MzeroMiko commented 2 weeks ago

In runtime, there should be no operation that can "eat" the random number, even the dataloader can not do that expect for the initialization or re-initialization. So if the random number you print is not the same, there must be something wrong with it.

zs1314 commented 2 weeks ago

@MzeroMiko I've tried the print random number operation and it's the same every time. This means that my SEED fix is working. But the inference and backpropagation of the mamba seems to be different each time! I would like to know if you experimented with a fixed seed operation and how did you do it? Or did you use multiple experiments to take an average?

zs1314 commented 2 weeks ago

@MzeroMiko Perhaps you can do an epoch of training after fixing the seed to see if the train loss and val loss, val acc are the same for both trainings under the same seed

MzeroMiko commented 2 weeks ago

actually there's a file called vmamba_check.py where there's code for checking the difference between different versions of code. That van be easily modified to check whether the running of vssm will change or keep the same across time.

MzeroMiko commented 2 weeks ago

@MzeroMiko Perhaps you can do an epoch of training after fixing the seed to see if the train loss and val loss, val acc are the same for both trainings under the same seed

So have you done this experiment? what is the conclusion?

zs1314 commented 2 weeks ago

@MzeroMiko Perhaps you can do an epoch of training after fixing the seed to see if the train loss and val loss, val acc are the same for both trainings under the same seed

So have you done this experiment? what is the conclusion?

@MzeroMiko Unfortunately, I tried. The results were inconsistent between the two times.Also, I have tested other convnext and Vit models and they are consistent `def seed(seed=0): random.seed(seed) np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
os.environ['PYTHONHASHSEED'] = str(seed) torch.cuda.manual_seed_all(seed)
torch.backends.cudnn.benchmark = False torch.backends.cudnn.deterministic = True torch.backends.cudnn.enabled = True os.environ['CUDA_LAUNCH_BLOCKING'] = '1'

os.environ['CUBLAS_WORKSPACE_CONFIG'] = ':16:8'

# torch.use_deterministic_algorithms(True)

`

zs1314 commented 2 weeks ago

@MzeroMiko New discovery: I've found this happening with other related visual mamba, like PlainMamba, etc.

JHChen1 commented 1 week ago

@MzeroMiko @zs1314 Hi, I also encountered this problem, have you solved it?

zs1314 commented 1 week ago

No!I only choose the average expernment result

---Original--- From: "Junhao @.> Date: Mon, Jul 22, 2024 18:07 PM To: @.>; Cc: @.**@.>; Subject: Re: [MzeroMiko/VMamba] About Seed (Issue #257)

@MzeroMiko @zs1314 Hi, I also encountered this problem, have you solved it?

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

JHChen1 commented 1 week ago

不!我只选择平均实验结果

thank you for your reply!

MzeroMiko commented 1 week ago

Thank you all for your findings, I will try to find out the reason in days once the machine is available.

MzeroMiko commented 3 days ago

I did a small experiment which shows that the difference may come from the gradient calculation in B, C, D, delta_bias. For more details, you can refer to selective_scan_bwd_kernel_oflex.cuh or selective_scan_bwd_kernel.cuh. I guess that the reason may be the random order of adding across different Blocks when using gpuAtomicAdd, as float(a) + float(b) does not always equal to float(b) + float(a).

However, though gpuAtomicAdd is also used when calculating the gradient of A, it is interesting that dA is not influenced and shows consistency across every run.

import torch
from models.csms6s import selective_scan_cuda_oflex, selective_scan_cuda

def s6fb(x: torch.Tensor, delta_softplus=True, oflex=True, backend=None):
    # x: B, k*C, L
    x = x.flatten(2, 3)
    N = 1
    K = 1
    B, KC, L = x.shape
    C = KC // K
    u = x
    delta = x.sigmoid().view(B, KC, L)
    A = -x.sigmoid().sum(0)[:, :N].view(KC, N)
    B = x.sigmoid().view(B, K, C, L)[:, :, :N, :]
    C = B + 1
    D = x.tanh().sum(0).sum(-1).view(KC)
    delta_bias = delta.sum(0).sum(-1).view(KC)

    # out, x, *rest = selective_scan_cuda_oflex.fwd(u, delta, A, B, C, D, delta_bias, delta_softplus, 1, oflex)
    out, x, *rest = selective_scan_cuda.fwd(u, delta, A, B, C, D, None, delta_bias, delta_softplus)

    dout = out.sigmoid()
    # du, ddelta, dA, dB, dC, dD, ddelta_bias, *rest = selective_scan_cuda_oflex.bwd(
    #     u, delta, A, B, C, D, delta_bias, dout, x, delta_softplus, 1
    # )

    du, ddelta, dA, dB, dC, dD, ddelta_bias, *rest = selective_scan_cuda.bwd(
        u, delta, A, B, C, D, None, delta_bias, dout, x, None, None, delta_softplus,
        False
    )

    return out, x, du, ddelta, dA, dB, dC, dD, ddelta_bias

def setseed(seed = 0):
    import torch
    import numpy as np 
    import random
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    np.random.seed(seed)
    random.seed(seed)

    if True: 
        torch.backends.cudnn.enabled = True
        torch.backends.cudnn.benchmark = True
        torch.backends.cudnn.deterministic = True

B, C, H, W = 128, 96, 56, 56 # dB, dC, dD, ddelta_bias not consistent
"""
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(819.5000, device='cuda:0') tensor(4., device='cuda:0')
tensor(307.8750, device='cuda:0') tensor(1., device='cuda:0')
tensor(1.9688, device='cuda:0') tensor(0.0625, device='cuda:0')
tensor(1.0391, device='cuda:0') tensor(0.0469, device='cuda:0')
"""
B, C, H, W = 2, 96, 56, 56 # dB, dC not consistent
"""
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(89.7734, device='cuda:0') tensor(0.1250, device='cuda:0')
tensor(29.1208, device='cuda:0') tensor(0.0312, device='cuda:0')
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
"""
B, C, H, W = 128, 2, 56, 56 # dD, ddelta_bias not consistent
"""
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(0.0312, device='cuda:0') tensor(0.0312, device='cuda:0')
tensor(0.0156, device='cuda:0') tensor(0.0156, device='cuda:0')
"""
B, C, H, W = 2, 2, 56, 56 # all consistent
"""
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
"""
setseed(0)
im1 = torch.randn((B, C, H, W)).cuda().requires_grad_()
out1 = s6fb(im1, backend="mamba")

setseed(0)
im2 = torch.randn((B, C, H, W)).cuda().requires_grad_()
out2 = s6fb(im2, backend="mamba")

for o1, o2 in zip(out1, out2):
    print((o1 - o2).abs().sum(), (o1 - o2).abs().max())