Results are significant different between RTX 2080Ti and RTX 3090

chinakook commented 3 years ago

I built mxnet with cuda 11.1 by myself, but I found there are significant difference between the result of RTX 2080Ti and the one of RTX 3090. I cannot get np.allclose with rtol=0.001 to True once I use the result of RTX 3090. I have tested the results with resnet18_v1(modified to same with torchvision), the results are following: MXNet 2.0 on RTX 3090 result:

   1.41843593e+00 -6.14944875e-01 -1.21827471e+00  1.47419822e+00
   1.08697571e-01 -1.53987074e+00 -2.19901204e-01  9.48053539e-01
   9.75863874e-01  1.70030773e+00  8.14817071e-01 -1.23302710e+00
   1.59906292e+00  6.93061709e-01 -1.53004932e+00 -1.63886517e-01
  -7.90785626e-02  2.69093782e-01 -6.79612219e-01  1.62834823e-01
   1.30419743e+00  3.55334133e-01  3.44635278e-01 -1.63632333e+00
  -1.83135128e+00 -2.71486902e+00 -1.90834343e+00 -1.56557214e+00
  -2.34904575e+00 -8.75294745e-01 -1.45051964e-02  2.31601214e+00

MXNet 2.0 on RTX 2080Ti:

   1.41812360e+00 -6.14904046e-01 -1.21819317e+00  1.47481441e+00
   1.08835481e-01 -1.53912461e+00 -2.19649583e-01  9.48446751e-01
   9.76122081e-01  1.70034432e+00  8.15561593e-01 -1.23293436e+00
   1.59933698e+00  6.92907929e-01 -1.53025842e+00 -1.63300186e-01
  -7.87981078e-02  2.69501388e-01 -6.79563940e-01  1.62799448e-01
   1.30361092e+00  3.54955167e-01  3.44287097e-01 -1.63627052e+00
  -1.83101940e+00 -2.71485949e+00 -1.90862203e+00 -1.56534243e+00
  -2.34861779e+00 -8.75208437e-01 -1.46625079e-02  2.31575775e+00

MXNet 2.0 on CPU:

   1.41812253e+00 -6.14904225e-01 -1.21819282e+00  1.47481489e+00
   1.08835749e-01 -1.53912461e+00 -2.19649076e-01  9.48446691e-01
   9.76122200e-01  1.70034420e+00  8.15561354e-01 -1.23293459e+00
   1.59933639e+00  6.92908108e-01 -1.53025806e+00 -1.63299382e-01
  -7.87984356e-02  2.69500166e-01 -6.79564059e-01  1.62798852e-01
   1.30361056e+00  3.54956239e-01  3.44287276e-01 -1.63627028e+00
  -1.83101881e+00 -2.71485925e+00 -1.90862203e+00 -1.56534243e+00
  -2.34861803e+00 -8.75208795e-01 -1.46625564e-02  2.31575823e+00

Torch 1.7 on RTX 3090

   1.41812313e+00 -6.14903867e-01 -1.21819305e+00  1.47481418e+00
   1.08835526e-01 -1.53912425e+00 -2.19649911e-01  9.48446572e-01
   9.76122499e-01  1.70034397e+00  8.15561354e-01 -1.23293447e+00
   1.59933650e+00  6.92907453e-01 -1.53025746e+00 -1.63299173e-01
  -7.87977725e-02  2.69501239e-01 -6.79563761e-01  1.62798911e-01
   1.30361116e+00  3.54956120e-01  3.44288558e-01 -1.63627124e+00
  -1.83101881e+00 -2.71485972e+00 -1.90862191e+00 -1.56534243e+00
  -2.34861827e+00 -8.75208020e-01 -1.46627314e-02  2.31575871e+00

Torch 1.7 on RTX 2080Ti

   1.41812313e+00 -6.14903808e-01 -1.21819329e+00  1.47481418e+00
   1.08835645e-01 -1.53912401e+00 -2.19649911e-01  9.48446631e-01
   9.76122320e-01  1.70034397e+00  8.15561354e-01 -1.23293436e+00
   1.59933674e+00  6.92907691e-01 -1.53025758e+00 -1.63299263e-01
  -7.87977204e-02  2.69501120e-01 -6.79563642e-01  1.62798882e-01
   1.30361140e+00  3.54956120e-01  3.44288498e-01 -1.63627124e+00
  -1.83101892e+00 -2.71485972e+00 -1.90862191e+00 -1.56534243e+00
  -2.34861827e+00 -8.75208080e-01 -1.46627687e-02  2.31575847e+00

Torch 1.7 on CPU

   1.41812289e+00 -6.14903927e-01 -1.21819293e+00  1.47481537e+00
   1.08835876e-01 -1.53912544e+00 -2.19649374e-01  9.48445439e-01
   9.76121962e-01  1.70034182e+00  8.15560222e-01 -1.23293531e+00
   1.59933591e+00  6.92908287e-01 -1.53025782e+00 -1.63299412e-01
  -7.87980631e-02  2.69500971e-01 -6.79563344e-01  1.62798971e-01
   1.30361187e+00  3.54957968e-01  3.44287753e-01 -1.63626969e+00
  -1.83101833e+00 -2.71485972e+00 -1.90862167e+00 -1.56534195e+00
  -2.34861779e+00 -8.75208795e-01 -1.46625713e-02  2.31575823e+00

szha commented 3 years ago

@chinakook thanks for reporting. how did you produce these results?

chinakook commented 3 years ago

@chinakook thanks for reporting. how did you produce these results?

I will do more tests, and then I will paste the test code here.

chinakook commented 3 years ago

The mxnet_cu110-1.9.0b20201226 built by official is good. I'll do more tests to find the reasons.

chinakook commented 3 years ago

The result also varies in mxnet_cu110-2.0.0b20201226 minimum test case to reproduce that:

import os
# os.environ['MXNET_CUDNN_AUTOTUNE_DEFAULT'] = '0'
import mxnet as mx
import numpy as np
from mxnet.gluon.model_zoo.vision.resnet import resnet18_v1

def testrestnet():
    ctx = mx.gpu(0)
    mx_model = resnet18_v1(pretrained=False)
    mx_model.hybridize()
    mx.random.seed(22)
    mx_model.initialize()

    mx_model.reset_ctx(ctx=ctx)

    np.random.seed(115)
    x = np.random.uniform(size=(1,3,224,224)).astype(np.float32)

    x_mx = mx.nd.array(x, ctx=ctx)

    y_mx = mx_model(x_mx)

    # the res is -1219.706 on RTX3090 with MXNET_CUDNN_AUTOTUNE_DEFAULT=0
    # the res varies on RTX3090 GPU without MXNET_CUDNN_AUTOTUNE_DEFAULT=0: -1219.7754, -1220.0055, -1220.0052, -1220.0051
    # the res is -1220.0052 on RTX2080Ti
    # the res is -1220.0062 on CPU
    res = y_mx.asnumpy().sum()

    print(res)

if __name__ == '__main__':
    testrestnet()

chinakook commented 3 years ago

After more tests, I found that the result also varies on RTX2080Ti on both MXNet 1.9.0 and MXNet 2.0.0. ~~The result have 0.005 difference in the shallow layer. I think it will have more difference as the layer grows.~~

import os
# os.environ['MXNET_CUDNN_AUTOTUNE_DEFAULT'] = '0'
import mxnet as mx
import numpy as np
from mxnet.gluon.model_zoo.vision.resnet import resnet18_v1

def testrestnet():
    ctx = mx.gpu(0)
    mx_model = resnet18_v1(pretrained=True,ctx=ctx)
    mx_model.hybridize()

    x_mx = mx.nd.ones(shape=(1,3,224,224), ctx=ctx)

    y_mx = mx_model.features[0:6](x_mx)

    # the res is always 13064.977 on CPU
    # the res varies on RTX2080Ti/RTX3090 on both MXNet 1.9.0 and 2.0.0 without 
    # MXNET_CUDNN_AUTOTUNE_DEFAULT=0: 13064.971, 13064.976
    res = y_mx.asnumpy().sum()

    print(res)

if __name__ == '__main__':
    testrestnet()

chinakook commented 3 years ago

A torch test case. Torch has lower difference between 2080Ti and 3090. However, MXNet on RTX3090 will have difference up to 0.3 in some cases.

    import torch 
    import torchvision as tv
    torch.backends.cudnn.benchmark=True

    model = tv.models.resnet18(pretrained=True)
    model.cuda(0)
    model.eval()

    # y is always 948.1921 on CPU
    # y is always 948.1919 on RTX2080Ti whenever cudnn.benchmark is True or False
    # y is 948.19165 on RTX3090 when cudnn.benchmark=False
    # y varies on RTX3090 when cudnn.benchmark=True: 948.19147, 948.1919
    x = torch.ones(1,3,224,224).cuda(0)
    y = model(x)
    y = y.abs().sum()
    print(y.detach().cpu().numpy())

Neutron3529 commented 3 years ago

After more tests, I found that the result also varies on RTX2080Ti on both MXNet 1.9.0 and MXNet 2.0.0. ~The result have 0.005 difference in the shallow layer. I think it will have more difference as the layer grows.~

import os
# os.environ['MXNET_CUDNN_AUTOTUNE_DEFAULT'] = '0'
import mxnet as mx
import numpy as np
from mxnet.gluon.model_zoo.vision.resnet import resnet18_v1

def testrestnet():
    ctx = mx.gpu(0)
    mx_model = resnet18_v1(pretrained=True,ctx=ctx)
    mx_model.hybridize()

    x_mx = mx.nd.ones(shape=(1,3,224,224), ctx=ctx)

    y_mx = mx_model.features[0:6](x_mx)

    # the res is always 13064.977 on CPU
    # the res varies on RTX2080Ti/RTX3090 on both MXNet 1.9.0 and 2.0.0 without 
    # MXNET_CUDNN_AUTOTUNE_DEFAULT=0: 13064.971, 13064.976
    res = y_mx.asnumpy().sum()

    print(res)

if __name__ == '__main__':
    testrestnet()

have you ever tried NVIDIA_TF32_OVERRIDE=0 python? 3090 using tf32 to accelerate training&testing by default, and using NVIDIA_TF32_OVERRIDE=0 will disable it.

chinakook commented 3 years ago

@Neutron3529 I think It has nothing to do with tf32. I've tested with NVIDIA_TF32_OVERRIDE=0 as you suggested, the problem is not solved.

Neutron3529 commented 3 years ago

@Neutron3529 I think It has nothing to do with tf32. I've tested with NVIDIA_TF32_OVERRIDE=0 as you suggested, the problem is not solved. my result (v1.x, compiled by myself):

>> import os
>> os.environ['MXNET_CUDNN_AUTOTUNE_DEFAULT'] = '0'
>> import mxnet as mx
>> import numpy as np
>> from mxnet.gluon.model_zoo.vision.resnet import resnet18_v1
>> def testrestnet(ctx=mx.gpu(0)):
mx_model = resnet18_v1(pretrained=True,ctx=ctx)
mx_model.hybridize()                           
x_mx = mx.nd.ones(shape=(1,3,224,224), ctx=ctx)
y_mx = mx_model.features[0:6](x_mx)
res = y_mx.asnumpy().sum()
print(res)
... 
>> testrestnet(mx.cpu())
Downloading /me/mxnet/models/resnet18_v1-a0666292.zipa165046a-afde-4d5a-a034-0163a93f6047 from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/models/resnet18_v1-a0666292.zip...
13064.974
>> testrestnet(mx.cpu())
13064.974
>> testrestnet(mx.cpu())
13064.974
>> testrestnet(mx.gpu())
13064.976
>> testrestnet(mx.gpu())
13064.976
--- using mxnet without TF_OVERRIDE
>> import os
>> os.environ['MXNET_CUDNN_AUTOTUNE_DEFAULT'] = '0'
>> import mxnet as mx
>> import numpy as np
>> from mxnet.gluon.model_zoo.vision.resnet import resnet18_v1
>> def testrestnet(ctx=mx.gpu(0)):
mx_model = resnet18_v1(pretrained=True,ctx=ctx)
mx_model.hybridize()                           
x_mx = mx.nd.ones(shape=(1,3,224,224), ctx=ctx)
y_mx = mx_model.features[0:6](x_mx)
res = y_mx.asnumpy().sum()
print(res)
... 
>> testrestnet()
13065.814
>> testrestnet(mx.cpu())
13064.974

it seems that NVIDIA_TF32_OVERRIDE=0 works for me, and without it may bring a huge bias.

what's more, my CPU generated the different result (13064.974 vs 13064.977)compared to yours, maybe the error is normal and do not worth an issue.

chinakook commented 3 years ago

@Neutron3529 Yes, v1.x is OK. MXNet 2 has this bug. I'll do more tests further.

TristonC commented 3 years ago

The TF32 is on by default from MXNet 1.8. Pytorch may have TF32 off by default.

apache / mxnet

Results are significant different between RTX 2080Ti and RTX 3090 #19649