Open chinakook opened 3 years ago
@chinakook thanks for reporting. how did you produce these results?
@chinakook thanks for reporting. how did you produce these results?
I will do more tests, and then I will paste the test code here.
The mxnet_cu110-1.9.0b20201226 built by official is good. I'll do more tests to find the reasons.
The result also varies in mxnet_cu110-2.0.0b20201226 minimum test case to reproduce that:
import os
# os.environ['MXNET_CUDNN_AUTOTUNE_DEFAULT'] = '0'
import mxnet as mx
import numpy as np
from mxnet.gluon.model_zoo.vision.resnet import resnet18_v1
def testrestnet():
ctx = mx.gpu(0)
mx_model = resnet18_v1(pretrained=False)
mx_model.hybridize()
mx.random.seed(22)
mx_model.initialize()
mx_model.reset_ctx(ctx=ctx)
np.random.seed(115)
x = np.random.uniform(size=(1,3,224,224)).astype(np.float32)
x_mx = mx.nd.array(x, ctx=ctx)
y_mx = mx_model(x_mx)
# the res is -1219.706 on RTX3090 with MXNET_CUDNN_AUTOTUNE_DEFAULT=0
# the res varies on RTX3090 GPU without MXNET_CUDNN_AUTOTUNE_DEFAULT=0: -1219.7754, -1220.0055, -1220.0052, -1220.0051
# the res is -1220.0052 on RTX2080Ti
# the res is -1220.0062 on CPU
res = y_mx.asnumpy().sum()
print(res)
if __name__ == '__main__':
testrestnet()
After more tests, I found that the result also varies on RTX2080Ti on both MXNet 1.9.0 and MXNet 2.0.0.
The result have 0.005 difference in the shallow layer. I think it will have more difference as the layer grows.
import os
# os.environ['MXNET_CUDNN_AUTOTUNE_DEFAULT'] = '0'
import mxnet as mx
import numpy as np
from mxnet.gluon.model_zoo.vision.resnet import resnet18_v1
def testrestnet():
ctx = mx.gpu(0)
mx_model = resnet18_v1(pretrained=True,ctx=ctx)
mx_model.hybridize()
x_mx = mx.nd.ones(shape=(1,3,224,224), ctx=ctx)
y_mx = mx_model.features[0:6](x_mx)
# the res is always 13064.977 on CPU
# the res varies on RTX2080Ti/RTX3090 on both MXNet 1.9.0 and 2.0.0 without
# MXNET_CUDNN_AUTOTUNE_DEFAULT=0: 13064.971, 13064.976
res = y_mx.asnumpy().sum()
print(res)
if __name__ == '__main__':
testrestnet()
A torch test case. Torch has lower difference between 2080Ti and 3090. However, MXNet on RTX3090 will have difference up to 0.3 in some cases.
import torch
import torchvision as tv
torch.backends.cudnn.benchmark=True
model = tv.models.resnet18(pretrained=True)
model.cuda(0)
model.eval()
# y is always 948.1921 on CPU
# y is always 948.1919 on RTX2080Ti whenever cudnn.benchmark is True or False
# y is 948.19165 on RTX3090 when cudnn.benchmark=False
# y varies on RTX3090 when cudnn.benchmark=True: 948.19147, 948.1919
x = torch.ones(1,3,224,224).cuda(0)
y = model(x)
y = y.abs().sum()
print(y.detach().cpu().numpy())
After more tests, I found that the result also varies on RTX2080Ti on both MXNet 1.9.0 and MXNet 2.0.0. ~The result have 0.005 difference in the shallow layer. I think it will have more difference as the layer grows.~
import os # os.environ['MXNET_CUDNN_AUTOTUNE_DEFAULT'] = '0' import mxnet as mx import numpy as np from mxnet.gluon.model_zoo.vision.resnet import resnet18_v1 def testrestnet(): ctx = mx.gpu(0) mx_model = resnet18_v1(pretrained=True,ctx=ctx) mx_model.hybridize() x_mx = mx.nd.ones(shape=(1,3,224,224), ctx=ctx) y_mx = mx_model.features[0:6](x_mx) # the res is always 13064.977 on CPU # the res varies on RTX2080Ti/RTX3090 on both MXNet 1.9.0 and 2.0.0 without # MXNET_CUDNN_AUTOTUNE_DEFAULT=0: 13064.971, 13064.976 res = y_mx.asnumpy().sum() print(res) if __name__ == '__main__': testrestnet()
have you ever tried NVIDIA_TF32_OVERRIDE=0 python
?
3090 using tf32 to accelerate training&testing by default, and using NVIDIA_TF32_OVERRIDE=0
will disable it.
@Neutron3529 I think It has nothing to do with tf32. I've tested with NVIDIA_TF32_OVERRIDE=0
as you suggested, the problem is not solved.
@Neutron3529 I think It has nothing to do with tf32. I've tested with
NVIDIA_TF32_OVERRIDE=0
as you suggested, the problem is not solved. my result (v1.x, compiled by myself):>> import os >> os.environ['MXNET_CUDNN_AUTOTUNE_DEFAULT'] = '0' >> import mxnet as mx >> import numpy as np >> from mxnet.gluon.model_zoo.vision.resnet import resnet18_v1 >> def testrestnet(ctx=mx.gpu(0)): mx_model = resnet18_v1(pretrained=True,ctx=ctx) mx_model.hybridize() x_mx = mx.nd.ones(shape=(1,3,224,224), ctx=ctx) y_mx = mx_model.features[0:6](x_mx) res = y_mx.asnumpy().sum() print(res) ... >> testrestnet(mx.cpu()) Downloading /me/mxnet/models/resnet18_v1-a0666292.zipa165046a-afde-4d5a-a034-0163a93f6047 from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/models/resnet18_v1-a0666292.zip... 13064.974 >> testrestnet(mx.cpu()) 13064.974 >> testrestnet(mx.cpu()) 13064.974 >> testrestnet(mx.gpu()) 13064.976 >> testrestnet(mx.gpu()) 13064.976 --- using mxnet without TF_OVERRIDE >> import os >> os.environ['MXNET_CUDNN_AUTOTUNE_DEFAULT'] = '0' >> import mxnet as mx >> import numpy as np >> from mxnet.gluon.model_zoo.vision.resnet import resnet18_v1 >> def testrestnet(ctx=mx.gpu(0)): mx_model = resnet18_v1(pretrained=True,ctx=ctx) mx_model.hybridize() x_mx = mx.nd.ones(shape=(1,3,224,224), ctx=ctx) y_mx = mx_model.features[0:6](x_mx) res = y_mx.asnumpy().sum() print(res) ... >> testrestnet() 13065.814 >> testrestnet(mx.cpu()) 13064.974
it seems that
NVIDIA_TF32_OVERRIDE=0
works for me, and without it may bring a huge bias.
what's more, my CPU generated the different result (13064.974
vs 13064.977
)compared to yours, maybe the error is normal and do not worth an issue.
@Neutron3529 Yes, v1.x is OK. MXNet 2 has this bug. I'll do more tests further.
The TF32 is on by default from MXNet 1.8. Pytorch may have TF32 off by default.
I built mxnet with cuda 11.1 by myself, but I found there are significant difference between the result of RTX 2080Ti and the one of RTX 3090. I cannot get np.allclose with rtol=0.001 to True once I use the result of RTX 3090. I have tested the results with resnet18_v1(modified to same with torchvision), the results are following: MXNet 2.0 on RTX 3090 result:
MXNet 2.0 on RTX 2080Ti:
MXNet 2.0 on CPU:
Torch 1.7 on RTX 3090
Torch 1.7 on RTX 2080Ti
Torch 1.7 on CPU