Closed marvis closed 6 years ago
One shorter piece of code without gluoncv to reproduce the issue with mxnet-cu90=1.3.0b20180724
:
from mxnet import nd, gluon
image = nd.ones((1,3,224,224))
model_name = 'ResNet18_v1'
net = gluon.model_zoo.vision.get_model(model_name, pretrained=True)
for i in range(10):
print(net(image)[0][0])
[-1.0796514]
<NDArray 1 @cpu(0)>
[-0.8152852]
<NDArray 1 @cpu(0)>
[-0.8152852]
<NDArray 1 @cpu(0)>
[-0.8152852]
<NDArray 1 @cpu(0)>
[-0.8152852]
<NDArray 1 @cpu(0)>
[-1.6667435]
<NDArray 1 @cpu(0)>
[-1.6667435]
<NDArray 1 @cpu(0)>
[-2.3310144]
<NDArray 1 @cpu(0)>
[-1.2227865]
<NDArray 1 @cpu(0)>
[-1.0537429]
<NDArray 1 @cpu(0)>
Repeatedly run the last command net(image)[0][0]
, you may find the results are not always the same.
On some certain machines the output is stable (not reproducing the issue), while on some others the output are changing. Not sure what's the cause yet.
The differing results on differing machines makes me think it might be an auto-tuning issue. To rule it out could you turn off auto-tuning and see if you still get nondeterministic behaviour?
Edit: Don't worry about auto-tuning. Since this is cpu only it's not related.
@KellenSunderland this is CPU-only.
@szha Ahh, good point. Thanks.
Some more information:
It can be easily reproduced on a p3.16xlarge instance with Deep Learning Base AMI (Ubuntu) Version 8.0 - ami-c83d62b0. The followings are based on this configuration:
mxnet-cu90=1.3.0b20180706
and later versions are buggy, while mxnet-cu90=1.3.0b20180703
and ealier are deterministic (10,000 repeat). @KellenSunderland the environmental variable doesn't make a difference.mx.gpu(0)
, the result is deterministic (10,000 repeat).mxnet=1.3.0b20180710
and later versions are buggy, while mxnet=1.3.0b20180706
and ealier are deterministic (10,000 repeat).It is pretty hard to reproduce on other instances. @marvis Can you please share the configuration of your environment reproducing this result?
Ubuntu 16.04.3 LTS Python 2.7.12 :: Anaconda custom (64-bit) pip install mxnet
import mxnet mxnet.version '1.2.1'
Thanks,
Does the net(image)
has some parameters like for_training?
It may caused by Dropout or Batchnorm layers which are varying in training forward.
Does the net(image) has some parameters like for_training? Not sure It may caused by Dropout or Batchnorm layers which are varying in training forward. see the network below,
ResNetV1( (features): HybridSequential( (0): Conv2D(3 -> 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False) (1): BatchNorm(fix_gamma=False, use_global_stats=False, eps=1e-05, momentum=0.9, axis=1, in_channels=64) (2): Activation(relu) (3): MaxPool2D(size=(3, 3), stride=(2, 2), padding=(1, 1), ceil_mode=False) (4): HybridSequential( (0): BasicBlockV1( (body): HybridSequential( (0): Conv2D(64 -> 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (1): BatchNorm(fix_gamma=False, use_global_stats=False, eps=1e-05, momentum=0.9, axis=1, in_channels=64) (2): Activation(relu) (3): Conv2D(64 -> 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (4): BatchNorm(fix_gamma=False, use_global_stats=False, eps=1e-05, momentum=0.9, axis=1, in_channels=64) ) ) (1): BasicBlockV1( (body): HybridSequential( (0): Conv2D(64 -> 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (1): BatchNorm(fix_gamma=False, use_global_stats=False, eps=1e-05, momentum=0.9, axis=1, in_channels=64) (2): Activation(relu) (3): Conv2D(64 -> 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (4): BatchNorm(fix_gamma=False, use_global_stats=False, eps=1e-05, momentum=0.9, axis=1, in_channels=64) ) ) ) (5): HybridSequential( (0): BasicBlockV1( (body): HybridSequential( (0): Conv2D(64 -> 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False) (1): BatchNorm(fix_gamma=False, use_global_stats=False, eps=1e-05, momentum=0.9, axis=1, in_channels=128) (2): Activation(relu) (3): Conv2D(128 -> 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (4): BatchNorm(fix_gamma=False, use_global_stats=False, eps=1e-05, momentum=0.9, axis=1, in_channels=128) ) (downsample): HybridSequential( (0): Conv2D(64 -> 128, kernel_size=(1, 1), stride=(2, 2), bias=False) (1): BatchNorm(fix_gamma=False, use_global_stats=False, eps=1e-05, momentum=0.9, axis=1, in_channels=128) ) ) (1): BasicBlockV1( (body): HybridSequential( (0): Conv2D(128 -> 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (1): BatchNorm(fix_gamma=False, use_global_stats=False, eps=1e-05, momentum=0.9, axis=1, in_channels=128) (2): Activation(relu) (3): Conv2D(128 -> 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (4): BatchNorm(fix_gamma=False, use_global_stats=False, eps=1e-05, momentum=0.9, axis=1, in_channels=128) ) ) ) (6): HybridSequential( (0): BasicBlockV1( (body): HybridSequential( (0): Conv2D(128 -> 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False) (1): BatchNorm(fix_gamma=False, use_global_stats=False, eps=1e-05, momentum=0.9, axis=1, in_channels=256) (2): Activation(relu) (3): Conv2D(256 -> 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (4): BatchNorm(fix_gamma=False, use_global_stats=False, eps=1e-05, momentum=0.9, axis=1, in_channels=256) ) (downsample): HybridSequential( (0): Conv2D(128 -> 256, kernel_size=(1, 1), stride=(2, 2), bias=False) (1): BatchNorm(fix_gamma=False, use_global_stats=False, eps=1e-05, momentum=0.9, axis=1, in_channels=256) ) ) (1): BasicBlockV1( (body): HybridSequential( (0): Conv2D(256 -> 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (1): BatchNorm(fix_gamma=False, use_global_stats=False, eps=1e-05, momentum=0.9, axis=1, in_channels=256) (2): Activation(relu) (3): Conv2D(256 -> 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (4): BatchNorm(fix_gamma=False, use_global_stats=False, eps=1e-05, momentum=0.9, axis=1, in_channels=256) ) ) ) (7): HybridSequential( (0): BasicBlockV1( (body): HybridSequential( (0): Conv2D(256 -> 512, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False) (1): BatchNorm(fix_gamma=False, use_global_stats=False, eps=1e-05, momentum=0.9, axis=1, in_channels=512) (2): Activation(relu) (3): Conv2D(512 -> 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (4): BatchNorm(fix_gamma=False, use_global_stats=False, eps=1e-05, momentum=0.9, axis=1, in_channels=512) ) (downsample): HybridSequential( (0): Conv2D(256 -> 512, kernel_size=(1, 1), stride=(2, 2), bias=False) (1): BatchNorm(fix_gamma=False, use_global_stats=False, eps=1e-05, momentum=0.9, axis=1, in_channels=512) ) ) (1): BasicBlockV1( (body): HybridSequential( (0): Conv2D(512 -> 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (1): BatchNorm(fix_gamma=False, use_global_stats=False, eps=1e-05, momentum=0.9, axis=1, in_channels=512) (2): Activation(relu) (3): Conv2D(512 -> 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (4): BatchNorm(fix_gamma=False, use_global_stats=False, eps=1e-05, momentum=0.9, axis=1, in_channels=512) ) ) ) (8): GlobalAvgPool2D(size=(1, 1), stride=(1, 1), padding=(0, 0), ceil_mode=True) ) (output): Dense(512 -> 1000, linear) )
I have concerns about BatchNorm layer with "fix_gamma=False" parameter. From another issue that I am looking at - https://github.com/apache/incubator-mxnet/issues/11774 says BatchNorm layer cannot ignore "beta", may be something related to gamma parameter as well. But, I will experiment and come back with my findings.
It is likely to be related to OpenBLAS.
0.2.20
.0.3.1
and it reproduces the issue. 0.3.1
and 0.3.0
also reproduces the issue.OpenBLAS 0.3.1 seems to have a similar non-deterministic bug reported at: https://github.com/JuliaLang/julia/issues/27960 https://github.com/xianyi/OpenBLAS/issues/1666
The issue can be reproduced with simple network with only convolution and dense layers:
import mxnet as mx
from mxnet import nd, gluon
from mxnet.gluon import nn
image = nd.ones((1,3,224,224))
# net = gluon.model_zoo.vision.get_model(model_name, pretrained=True)
net = nn.Sequential()
net.add(
nn.Conv2D(channels=5, kernel_size=1, activation='relu'),
nn.Dense(65),
nn.Dense(1)
)
net.initialize()
val = net(image)[0][0].asscalar()
print(mx.__path__)
for i in range(10000):
tmp = net(image)[0][0].asscalar()
if not tmp == val:
print("Error!! : val=%f, tmp=%f, i=%d" % (val, tmp, i))
break
print("All Good!! : val=%f, tmp=%f, i=%d" % (val, tmp, i))
Indeed, mxnet 1.2.1 pip was released with openblas 0.3.1. Based on this info I think we should issue a post release update and revert openblas to the previous known stable version 0.2.20. Will raise on dev@ and users@.
Summary:
We found it is possible to reproduce on p3.16xlarge with Intel(R) Xeon(R) CPU E5-2686 v4
, and OpenBLAS 0.3.0 and 0.3.1. While with OpenBLAS 0.2.20, we cannot reproduce the issue anymore.
Under GPU context (regardless of OpenBLAS version), we cannot reproduce the issue.
Just tested a pip build where openblas is replaced with 0.2.20, we cannot see the randomness in the forward. Therefore we recommend to ship new pip package for 1.3.0 and 1.2.1 with openblas at 0.2.20 and that should fix the issue.
Hi,
I would like to get all the forward outputs with the following scripts. However I find the results is different every time. The inputs and model params are the same, but get different outputs. It is very strange. Can you help me to figure out this problem?
Best,