Open zs1314 opened 4 months ago
What do you mean by “inconsistent training results”? It the performance differ from what we reported or differ across multiple experiments you done with the same config and seed?
@MzeroMiko differ across multiple experiments wo do with the same config and seed
Here's a trick:
You can run N iters of training, and then print out torch.randn((1,))
to see whether the random seed is used up the same way across every experiments. The number it prints should be the same.
@MzeroMiko Thank you for your answer! I've actually tried it! I wonder if it's some operation in VSSM that can't be fixed by randomizing the seeds
In runtime, there should be no operation that can "eat" the random number, even the dataloader can not do that expect for the initialization or re-initialization. So if the random number you print is not the same, there must be something wrong with it.
@MzeroMiko I've tried the print random number operation and it's the same every time. This means that my SEED fix is working. But the inference and backpropagation of the mamba seems to be different each time! I would like to know if you experimented with a fixed seed operation and how did you do it? Or did you use multiple experiments to take an average?
@MzeroMiko Perhaps you can do an epoch of training after fixing the seed to see if the train loss and val loss, val acc are the same for both trainings under the same seed
actually there's a file called vmamba_check.py where there's code for checking the difference between different versions of code. That van be easily modified to check whether the running of vssm will change or keep the same across time.
@MzeroMiko Perhaps you can do an epoch of training after fixing the seed to see if the train loss and val loss, val acc are the same for both trainings under the same seed
So have you done this experiment? what is the conclusion?
@MzeroMiko Perhaps you can do an epoch of training after fixing the seed to see if the train loss and val loss, val acc are the same for both trainings under the same seed
So have you done this experiment? what is the conclusion?
@MzeroMiko Unfortunately, I tried. The results were inconsistent between the two times.Also, I have tested other convnext and Vit models and they are consistent
`def seed(seed=0):
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
os.environ['PYTHONHASHSEED'] = str(seed)
torch.cuda.manual_seed_all(seed)
torch.backends.cudnn.benchmark = False
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.enabled = True
os.environ['CUDA_LAUNCH_BLOCKING'] = '1'
# torch.use_deterministic_algorithms(True)
`
@MzeroMiko New discovery: I've found this happening with other related visual mamba, like PlainMamba, etc.
@MzeroMiko @zs1314 Hi, I also encountered this problem, have you solved it?
No!I only choose the average expernment result
---Original--- From: "Junhao @.> Date: Mon, Jul 22, 2024 18:07 PM To: @.>; Cc: @.**@.>; Subject: Re: [MzeroMiko/VMamba] About Seed (Issue #257)
@MzeroMiko @zs1314 Hi, I also encountered this problem, have you solved it?
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>
不!我只选择平均实验结果
thank you for your reply!
Thank you all for your findings, I will try to find out the reason in days once the machine is available.
I did a small experiment which shows that the difference may come from the gradient calculation in B, C, D, delta_bias
. For more details, you can refer to selective_scan_bwd_kernel_oflex.cuh
or selective_scan_bwd_kernel.cuh
. I guess that the reason may be the random order of adding across different Blocks
when using gpuAtomicAdd
, as float(a) + float(b)
does not always equal to float(b) + float(a)
.
However, though gpuAtomicAdd
is also used when calculating the gradient of A
, it is interesting that dA
is not influenced and shows consistency across every run.
import torch
from models.csms6s import selective_scan_cuda_oflex, selective_scan_cuda
def s6fb(x: torch.Tensor, delta_softplus=True, oflex=True, backend=None):
# x: B, k*C, L
x = x.flatten(2, 3)
N = 1
K = 1
B, KC, L = x.shape
C = KC // K
u = x
delta = x.sigmoid().view(B, KC, L)
A = -x.sigmoid().sum(0)[:, :N].view(KC, N)
B = x.sigmoid().view(B, K, C, L)[:, :, :N, :]
C = B + 1
D = x.tanh().sum(0).sum(-1).view(KC)
delta_bias = delta.sum(0).sum(-1).view(KC)
# out, x, *rest = selective_scan_cuda_oflex.fwd(u, delta, A, B, C, D, delta_bias, delta_softplus, 1, oflex)
out, x, *rest = selective_scan_cuda.fwd(u, delta, A, B, C, D, None, delta_bias, delta_softplus)
dout = out.sigmoid()
# du, ddelta, dA, dB, dC, dD, ddelta_bias, *rest = selective_scan_cuda_oflex.bwd(
# u, delta, A, B, C, D, delta_bias, dout, x, delta_softplus, 1
# )
du, ddelta, dA, dB, dC, dD, ddelta_bias, *rest = selective_scan_cuda.bwd(
u, delta, A, B, C, D, None, delta_bias, dout, x, None, None, delta_softplus,
False
)
return out, x, du, ddelta, dA, dB, dC, dD, ddelta_bias
def setseed(seed = 0):
import torch
import numpy as np
import random
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
np.random.seed(seed)
random.seed(seed)
if True:
torch.backends.cudnn.enabled = True
torch.backends.cudnn.benchmark = True
torch.backends.cudnn.deterministic = True
B, C, H, W = 128, 96, 56, 56 # dB, dC, dD, ddelta_bias not consistent
"""
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(819.5000, device='cuda:0') tensor(4., device='cuda:0')
tensor(307.8750, device='cuda:0') tensor(1., device='cuda:0')
tensor(1.9688, device='cuda:0') tensor(0.0625, device='cuda:0')
tensor(1.0391, device='cuda:0') tensor(0.0469, device='cuda:0')
"""
B, C, H, W = 2, 96, 56, 56 # dB, dC not consistent
"""
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(89.7734, device='cuda:0') tensor(0.1250, device='cuda:0')
tensor(29.1208, device='cuda:0') tensor(0.0312, device='cuda:0')
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
"""
B, C, H, W = 128, 2, 56, 56 # dD, ddelta_bias not consistent
"""
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(0.0312, device='cuda:0') tensor(0.0312, device='cuda:0')
tensor(0.0156, device='cuda:0') tensor(0.0156, device='cuda:0')
"""
B, C, H, W = 2, 2, 56, 56 # all consistent
"""
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
tensor(0., device='cuda:0') tensor(0., device='cuda:0')
"""
setseed(0)
im1 = torch.randn((B, C, H, W)).cuda().requires_grad_()
out1 = s6fb(im1, backend="mamba")
setseed(0)
im2 = torch.randn((B, C, H, W)).cuda().requires_grad_()
out2 = s6fb(im2, backend="mamba")
for o1, o2 in zip(out1, out2):
print((o1 - o2).abs().sum(), (o1 - o2).abs().max())
我做了一个小实验,发现差异可能来自于 中的梯度计算
B, C, D, delta_bias
。有关更多详细信息,您可以参考selective_scan_bwd_kernel_oflex.cuh
或selective_scan_bwd_kernel.cuh
。我猜测原因可能是Blocks
在使用 时在不同 中添加的顺序是随机的gpuAtomicAdd
,因为float(a) + float(b)
并不总是等于float(b) + float(a)
。然而,尽管
gpuAtomicAdd
在计算的梯度时也会用到A
,但有趣的是,它dA
并不受影响,并且在每次运行中都保持一致。import torch from models.csms6s import selective_scan_cuda_oflex, selective_scan_cuda def s6fb(x: torch.Tensor, delta_softplus=True, oflex=True, backend=None): # x: B, k*C, L x = x.flatten(2, 3) N = 1 K = 1 B, KC, L = x.shape C = KC // K u = x delta = x.sigmoid().view(B, KC, L) A = -x.sigmoid().sum(0)[:, :N].view(KC, N) B = x.sigmoid().view(B, K, C, L)[:, :, :N, :] C = B + 1 D = x.tanh().sum(0).sum(-1).view(KC) delta_bias = delta.sum(0).sum(-1).view(KC) # out, x, *rest = selective_scan_cuda_oflex.fwd(u, delta, A, B, C, D, delta_bias, delta_softplus, 1, oflex) out, x, *rest = selective_scan_cuda.fwd(u, delta, A, B, C, D, None, delta_bias, delta_softplus) dout = out.sigmoid() # du, ddelta, dA, dB, dC, dD, ddelta_bias, *rest = selective_scan_cuda_oflex.bwd( # u, delta, A, B, C, D, delta_bias, dout, x, delta_softplus, 1 # ) du, ddelta, dA, dB, dC, dD, ddelta_bias, *rest = selective_scan_cuda.bwd( u, delta, A, B, C, D, None, delta_bias, dout, x, None, None, delta_softplus, False ) return out, x, du, ddelta, dA, dB, dC, dD, ddelta_bias def setseed(seed = 0): import torch import numpy as np import random torch.manual_seed(seed) torch.cuda.manual_seed(seed) np.random.seed(seed) random.seed(seed) if True: torch.backends.cudnn.enabled = True torch.backends.cudnn.benchmark = True torch.backends.cudnn.deterministic = True B, C, H, W = 128, 96, 56, 56 # dB, dC, dD, ddelta_bias not consistent """ tensor(0., device='cuda:0') tensor(0., device='cuda:0') tensor(0., device='cuda:0') tensor(0., device='cuda:0') tensor(0., device='cuda:0') tensor(0., device='cuda:0') tensor(0., device='cuda:0') tensor(0., device='cuda:0') tensor(0., device='cuda:0') tensor(0., device='cuda:0') tensor(819.5000, device='cuda:0') tensor(4., device='cuda:0') tensor(307.8750, device='cuda:0') tensor(1., device='cuda:0') tensor(1.9688, device='cuda:0') tensor(0.0625, device='cuda:0') tensor(1.0391, device='cuda:0') tensor(0.0469, device='cuda:0') """ B, C, H, W = 2, 96, 56, 56 # dB, dC not consistent """ tensor(0., device='cuda:0') tensor(0., device='cuda:0') tensor(0., device='cuda:0') tensor(0., device='cuda:0') tensor(0., device='cuda:0') tensor(0., device='cuda:0') tensor(0., device='cuda:0') tensor(0., device='cuda:0') tensor(0., device='cuda:0') tensor(0., device='cuda:0') tensor(89.7734, device='cuda:0') tensor(0.1250, device='cuda:0') tensor(29.1208, device='cuda:0') tensor(0.0312, device='cuda:0') tensor(0., device='cuda:0') tensor(0., device='cuda:0') tensor(0., device='cuda:0') tensor(0., device='cuda:0') """ B, C, H, W = 128, 2, 56, 56 # dD, ddelta_bias not consistent """ tensor(0., device='cuda:0') tensor(0., device='cuda:0') tensor(0., device='cuda:0') tensor(0., device='cuda:0') tensor(0., device='cuda:0') tensor(0., device='cuda:0') tensor(0., device='cuda:0') tensor(0., device='cuda:0') tensor(0., device='cuda:0') tensor(0., device='cuda:0') tensor(0., device='cuda:0') tensor(0., device='cuda:0') tensor(0., device='cuda:0') tensor(0., device='cuda:0') tensor(0.0312, device='cuda:0') tensor(0.0312, device='cuda:0') tensor(0.0156, device='cuda:0') tensor(0.0156, device='cuda:0') """ B, C, H, W = 2, 2, 56, 56 # all consistent """ tensor(0., device='cuda:0') tensor(0., device='cuda:0') tensor(0., device='cuda:0') tensor(0., device='cuda:0') tensor(0., device='cuda:0') tensor(0., device='cuda:0') tensor(0., device='cuda:0') tensor(0., device='cuda:0') tensor(0., device='cuda:0') tensor(0., device='cuda:0') tensor(0., device='cuda:0') tensor(0., device='cuda:0') tensor(0., device='cuda:0') tensor(0., device='cuda:0') tensor(0., device='cuda:0') tensor(0., device='cuda:0') tensor(0., device='cuda:0') tensor(0., device='cuda:0') """ setseed(0) im1 = torch.randn((B, C, H, W)).cuda().requires_grad_() out1 = s6fb(im1, backend="mamba") setseed(0) im2 = torch.randn((B, C, H, W)).cuda().requires_grad_() out2 = s6fb(im2, backend="mamba") for o1, o2 in zip(out1, out2): print((o1 - o2).abs().sum(), (o1 - o2).abs().max())
Thank you very much for your excellent work and response. Is there any way you can solve the inconsistency?
I did a small experiment which shows that the difference may come from the gradient calculation in
B, C, D, delta_bias
. For more details, you can refer toselective_scan_bwd_kernel_oflex.cuh
orselective_scan_bwd_kernel.cuh
. I guess that the reason may be the random order of adding across differentBlocks
when usinggpuAtomicAdd
, asfloat(a) + float(b)
does not always equal tofloat(b) + float(a)
.However, though
gpuAtomicAdd
is also used when calculating the gradient ofA
, it is interesting thatdA
is not influenced and shows consistency across every run.import torch from models.csms6s import selective_scan_cuda_oflex, selective_scan_cuda def s6fb(x: torch.Tensor, delta_softplus=True, oflex=True, backend=None): # x: B, k*C, L x = x.flatten(2, 3) N = 1 K = 1 B, KC, L = x.shape C = KC // K u = x delta = x.sigmoid().view(B, KC, L) A = -x.sigmoid().sum(0)[:, :N].view(KC, N) B = x.sigmoid().view(B, K, C, L)[:, :, :N, :] C = B + 1 D = x.tanh().sum(0).sum(-1).view(KC) delta_bias = delta.sum(0).sum(-1).view(KC) # out, x, *rest = selective_scan_cuda_oflex.fwd(u, delta, A, B, C, D, delta_bias, delta_softplus, 1, oflex) out, x, *rest = selective_scan_cuda.fwd(u, delta, A, B, C, D, None, delta_bias, delta_softplus) dout = out.sigmoid() # du, ddelta, dA, dB, dC, dD, ddelta_bias, *rest = selective_scan_cuda_oflex.bwd( # u, delta, A, B, C, D, delta_bias, dout, x, delta_softplus, 1 # ) du, ddelta, dA, dB, dC, dD, ddelta_bias, *rest = selective_scan_cuda.bwd( u, delta, A, B, C, D, None, delta_bias, dout, x, None, None, delta_softplus, False ) return out, x, du, ddelta, dA, dB, dC, dD, ddelta_bias def setseed(seed = 0): import torch import numpy as np import random torch.manual_seed(seed) torch.cuda.manual_seed(seed) np.random.seed(seed) random.seed(seed) if True: torch.backends.cudnn.enabled = True torch.backends.cudnn.benchmark = True torch.backends.cudnn.deterministic = True B, C, H, W = 128, 96, 56, 56 # dB, dC, dD, ddelta_bias not consistent """ tensor(0., device='cuda:0') tensor(0., device='cuda:0') tensor(0., device='cuda:0') tensor(0., device='cuda:0') tensor(0., device='cuda:0') tensor(0., device='cuda:0') tensor(0., device='cuda:0') tensor(0., device='cuda:0') tensor(0., device='cuda:0') tensor(0., device='cuda:0') tensor(819.5000, device='cuda:0') tensor(4., device='cuda:0') tensor(307.8750, device='cuda:0') tensor(1., device='cuda:0') tensor(1.9688, device='cuda:0') tensor(0.0625, device='cuda:0') tensor(1.0391, device='cuda:0') tensor(0.0469, device='cuda:0') """ B, C, H, W = 2, 96, 56, 56 # dB, dC not consistent """ tensor(0., device='cuda:0') tensor(0., device='cuda:0') tensor(0., device='cuda:0') tensor(0., device='cuda:0') tensor(0., device='cuda:0') tensor(0., device='cuda:0') tensor(0., device='cuda:0') tensor(0., device='cuda:0') tensor(0., device='cuda:0') tensor(0., device='cuda:0') tensor(89.7734, device='cuda:0') tensor(0.1250, device='cuda:0') tensor(29.1208, device='cuda:0') tensor(0.0312, device='cuda:0') tensor(0., device='cuda:0') tensor(0., device='cuda:0') tensor(0., device='cuda:0') tensor(0., device='cuda:0') """ B, C, H, W = 128, 2, 56, 56 # dD, ddelta_bias not consistent """ tensor(0., device='cuda:0') tensor(0., device='cuda:0') tensor(0., device='cuda:0') tensor(0., device='cuda:0') tensor(0., device='cuda:0') tensor(0., device='cuda:0') tensor(0., device='cuda:0') tensor(0., device='cuda:0') tensor(0., device='cuda:0') tensor(0., device='cuda:0') tensor(0., device='cuda:0') tensor(0., device='cuda:0') tensor(0., device='cuda:0') tensor(0., device='cuda:0') tensor(0.0312, device='cuda:0') tensor(0.0312, device='cuda:0') tensor(0.0156, device='cuda:0') tensor(0.0156, device='cuda:0') """ B, C, H, W = 2, 2, 56, 56 # all consistent """ tensor(0., device='cuda:0') tensor(0., device='cuda:0') tensor(0., device='cuda:0') tensor(0., device='cuda:0') tensor(0., device='cuda:0') tensor(0., device='cuda:0') tensor(0., device='cuda:0') tensor(0., device='cuda:0') tensor(0., device='cuda:0') tensor(0., device='cuda:0') tensor(0., device='cuda:0') tensor(0., device='cuda:0') tensor(0., device='cuda:0') tensor(0., device='cuda:0') tensor(0., device='cuda:0') tensor(0., device='cuda:0') tensor(0., device='cuda:0') tensor(0., device='cuda:0') """ setseed(0) im1 = torch.randn((B, C, H, W)).cuda().requires_grad_() out1 = s6fb(im1, backend="mamba") setseed(0) im2 = torch.randn((B, C, H, W)).cuda().requires_grad_() out2 = s6fb(im2, backend="mamba") for o1, o2 in zip(out1, out2): print((o1 - o2).abs().sum(), (o1 - o2).abs().max())
我也发现了这个问题,请问cuda算子里的不确定性能修复吗,我在做实验时每次的结果很有不小的差别
Hello, I'm sorry to bother you! Your work is excellent! However, when I conducted my experiment, I found that after fixing the random seeds, I still had the problem of inconsistent training results, in this case, how can I determine if my improvement is effective. First of all, I also ruled out other environmental problems: I tested other models and they are reproducibly fixed after fixing the seed