gluon forward produce random results

marvis commented 6 years ago

Hi,

I would like to get all the forward outputs with the following scripts. However I find the results is different every time. The inputs and model params are the same, but get different outputs. It is very strange. Can you help me to figure out this problem?

Best,

from gluoncv import model_zoo
from mxnet import nd, gluon
import mxnet as mx

def forward_mxnet(model_name, image):
    net = model_zoo.get_model(model_name, pretrained=True, prefix='%s_' % model_name.replace('.', 'dot'))
    input_symbols = mx.sym.var('data')
    output_symbol = net(input_symbols)
    internals = output_symbol.get_internals()
    output_names = internals.list_outputs()
    output_symbols = [internals[name] for name in output_names]
    params = net.collect_params()
    model = gluon.SymbolBlock(output_symbols, input_symbols, params=params)
    outputs = model(image)
    output_blobs = dict()
    for name, output in zip(output_names, outputs):
        if name.find('_output') >= 0:
            output_blobs[name] = output
    return output_blobs, params

#image = nd.random.uniform(shape=(1, 3, 224, 224))
image = nd.ones((1,3,224,224))
model_name = 'ResNet18_v1'
mxnet_blobs, mxnet_params = forward_mxnet(model_name, image)

print('image.mean() = %f' % image.asnumpy().mean())
for key in mxnet_blobs:
    blob = mxnet_blobs[key]
    print('%40s mean %f' % (key, blob.asnumpy().mean()))

hetong007 commented 6 years ago

One shorter piece of code without gluoncv to reproduce the issue with mxnet-cu90=1.3.0b20180724:

from mxnet import nd, gluon

image = nd.ones((1,3,224,224))
model_name = 'ResNet18_v1'
net = gluon.model_zoo.vision.get_model(model_name, pretrained=True)
for i in range(10):
    print(net(image)[0][0])

[-1.0796514]
<NDArray 1 @cpu(0)>

[-0.8152852]
<NDArray 1 @cpu(0)>

[-0.8152852]
<NDArray 1 @cpu(0)>

[-0.8152852]
<NDArray 1 @cpu(0)>

[-0.8152852]
<NDArray 1 @cpu(0)>

[-1.6667435]
<NDArray 1 @cpu(0)>

[-1.6667435]
<NDArray 1 @cpu(0)>

[-2.3310144]
<NDArray 1 @cpu(0)>

[-1.2227865]
<NDArray 1 @cpu(0)>

[-1.0537429]
<NDArray 1 @cpu(0)>

Repeatedly run the last command net(image)[0][0], you may find the results are not always the same.

On some certain machines the output is stable (not reproducing the issue), while on some others the output are changing. Not sure what's the cause yet.

KellenSunderland commented 6 years ago

The differing results on differing machines makes me think it might be an auto-tuning issue. To rule it out could you turn off auto-tuning and see if you still get nondeterministic behaviour?
Edit: Don't worry about auto-tuning. Since this is cpu only it's not related.

szha commented 6 years ago

@KellenSunderland this is CPU-only.

KellenSunderland commented 6 years ago

@szha Ahh, good point. Thanks.

hetong007 commented 6 years ago

Some more information:

It can be easily reproduced on a p3.16xlarge instance with Deep Learning Base AMI (Ubuntu) Version 8.0 - ami-c83d62b0. The followings are based on this configuration:

pip version with CU90: mxnet-cu90=1.3.0b20180706 and later versions are buggy, while mxnet-cu90=1.3.0b20180703 and ealier are deterministic (10,000 repeat). @KellenSunderland the environmental variable doesn't make a difference.
If executing with context mx.gpu(0), the result is deterministic (10,000 repeat).
pip version with only CPU: mxnet=1.3.0b20180710 and later versions are buggy, while mxnet=1.3.0b20180706 and ealier are deterministic (10,000 repeat).

It is pretty hard to reproduce on other instances. @marvis Can you please share the configuration of your environment reproducing this result?

marvis commented 6 years ago

Ubuntu 16.04.3 LTS Python 2.7.12 :: Anaconda custom (64-bit) pip install mxnet

import mxnet mxnet.version '1.2.1'

Thanks,

chinakook commented 6 years ago

Does the net(image) has some parameters like for_training? It may caused by Dropout or Batchnorm layers which are varying in training forward.

marvis commented 6 years ago

Does the net(image) has some parameters like for_training? Not sure It may caused by Dropout or Batchnorm layers which are varying in training forward. see the network below,

ResNetV1(
(features): HybridSequential(
(0): Conv2D(3 -> 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
(1): BatchNorm(fix_gamma=False, use_global_stats=False, eps=1e-05, momentum=0.9, axis=1, in_channels=64)
(2): Activation(relu)
(3): MaxPool2D(size=(3, 3), stride=(2, 2), padding=(1, 1), ceil_mode=False)
(4): HybridSequential(
(0): BasicBlockV1(
(body): HybridSequential(
(0): Conv2D(64 -> 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(1): BatchNorm(fix_gamma=False, use_global_stats=False, eps=1e-05, momentum=0.9, axis=1, in_channels=64)
(2): Activation(relu)
(3): Conv2D(64 -> 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(4): BatchNorm(fix_gamma=False, use_global_stats=False, eps=1e-05, momentum=0.9, axis=1, in_channels=64)
)
)
(1): BasicBlockV1(
(body): HybridSequential(
(0): Conv2D(64 -> 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(1): BatchNorm(fix_gamma=False, use_global_stats=False, eps=1e-05, momentum=0.9, axis=1, in_channels=64)
(2): Activation(relu)
(3): Conv2D(64 -> 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(4): BatchNorm(fix_gamma=False, use_global_stats=False, eps=1e-05, momentum=0.9, axis=1, in_channels=64)
)
)
)
(5): HybridSequential(
(0): BasicBlockV1(
(body): HybridSequential(
(0): Conv2D(64 -> 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
(1): BatchNorm(fix_gamma=False, use_global_stats=False, eps=1e-05, momentum=0.9, axis=1, in_channels=128)
(2): Activation(relu)
(3): Conv2D(128 -> 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(4): BatchNorm(fix_gamma=False, use_global_stats=False, eps=1e-05, momentum=0.9, axis=1, in_channels=128)
)
(downsample): HybridSequential(
(0): Conv2D(64 -> 128, kernel_size=(1, 1), stride=(2, 2), bias=False)
(1): BatchNorm(fix_gamma=False, use_global_stats=False, eps=1e-05, momentum=0.9, axis=1, in_channels=128)
)
)
(1): BasicBlockV1(
(body): HybridSequential(
(0): Conv2D(128 -> 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(1): BatchNorm(fix_gamma=False, use_global_stats=False, eps=1e-05, momentum=0.9, axis=1, in_channels=128)
(2): Activation(relu)
(3): Conv2D(128 -> 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(4): BatchNorm(fix_gamma=False, use_global_stats=False, eps=1e-05, momentum=0.9, axis=1, in_channels=128)
)
)
)
(6): HybridSequential(
(0): BasicBlockV1(
(body): HybridSequential(
(0): Conv2D(128 -> 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
(1): BatchNorm(fix_gamma=False, use_global_stats=False, eps=1e-05, momentum=0.9, axis=1, in_channels=256)
(2): Activation(relu)
(3): Conv2D(256 -> 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(4): BatchNorm(fix_gamma=False, use_global_stats=False, eps=1e-05, momentum=0.9, axis=1, in_channels=256)
)
(downsample): HybridSequential(
(0): Conv2D(128 -> 256, kernel_size=(1, 1), stride=(2, 2), bias=False)
(1): BatchNorm(fix_gamma=False, use_global_stats=False, eps=1e-05, momentum=0.9, axis=1, in_channels=256)
)
)
(1): BasicBlockV1(
(body): HybridSequential(
(0): Conv2D(256 -> 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(1): BatchNorm(fix_gamma=False, use_global_stats=False, eps=1e-05, momentum=0.9, axis=1, in_channels=256)
(2): Activation(relu)
(3): Conv2D(256 -> 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(4): BatchNorm(fix_gamma=False, use_global_stats=False, eps=1e-05, momentum=0.9, axis=1, in_channels=256)
)
)
)
(7): HybridSequential(
(0): BasicBlockV1(
(body): HybridSequential(
(0): Conv2D(256 -> 512, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
(1): BatchNorm(fix_gamma=False, use_global_stats=False, eps=1e-05, momentum=0.9, axis=1, in_channels=512)
(2): Activation(relu)
(3): Conv2D(512 -> 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(4): BatchNorm(fix_gamma=False, use_global_stats=False, eps=1e-05, momentum=0.9, axis=1, in_channels=512)
)
(downsample): HybridSequential(
(0): Conv2D(256 -> 512, kernel_size=(1, 1), stride=(2, 2), bias=False)
(1): BatchNorm(fix_gamma=False, use_global_stats=False, eps=1e-05, momentum=0.9, axis=1, in_channels=512)
)
)
(1): BasicBlockV1(
(body): HybridSequential(
(0): Conv2D(512 -> 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(1): BatchNorm(fix_gamma=False, use_global_stats=False, eps=1e-05, momentum=0.9, axis=1, in_channels=512)
(2): Activation(relu)
(3): Conv2D(512 -> 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(4): BatchNorm(fix_gamma=False, use_global_stats=False, eps=1e-05, momentum=0.9, axis=1, in_channels=512)
)
)
)
(8): GlobalAvgPool2D(size=(1, 1), stride=(1, 1), padding=(0, 0), ceil_mode=True)
)
(output): Dense(512 -> 1000, linear)
)

sandeep-krishnamurthy commented 6 years ago

I have concerns about BatchNorm layer with "fix_gamma=False" parameter. From another issue that I am looking at - https://github.com/apache/incubator-mxnet/issues/11774 says BatchNorm layer cannot ignore "beta", may be something related to gamma parameter as well. But, I will experiment and come back with my findings.

hetong007 commented 6 years ago

It is likely to be related to OpenBLAS.

Testing with package built from source cannot reproduce the issue, which is openblas 0.2.20.
The latest pip package updated openblas to 0.3.1 and it reproduces the issue.
Package built mxnet from source with OpenBLAS 0.3.1 and 0.3.0 also reproduces the issue.
All above versions don't reproduce the issue on GPU context.

OpenBLAS 0.3.1 seems to have a similar non-deterministic bug reported at: https://github.com/JuliaLang/julia/issues/27960 https://github.com/xianyi/OpenBLAS/issues/1666

The issue can be reproduced with simple network with only convolution and dense layers:

import mxnet as mx
from mxnet import nd, gluon
from mxnet.gluon import nn

image = nd.ones((1,3,224,224))
# net = gluon.model_zoo.vision.get_model(model_name, pretrained=True)

net = nn.Sequential()
net.add(
    nn.Conv2D(channels=5, kernel_size=1, activation='relu'),
    nn.Dense(65),
    nn.Dense(1)
)

net.initialize()

val = net(image)[0][0].asscalar()

print(mx.__path__)
for i in range(10000):
    tmp = net(image)[0][0].asscalar()
    if not tmp == val:
        print("Error!! : val=%f, tmp=%f, i=%d" % (val, tmp, i))
        break
    print("All Good!! : val=%f, tmp=%f, i=%d" % (val, tmp, i))

szha commented 6 years ago

Indeed, mxnet 1.2.1 pip was released with openblas 0.3.1. Based on this info I think we should issue a post release update and revert openblas to the previous known stable version 0.2.20. Will raise on dev@ and users@.

hetong007 commented 6 years ago

Summary:

We found it is possible to reproduce on p3.16xlarge with Intel(R) Xeon(R) CPU E5-2686 v4, and OpenBLAS 0.3.0 and 0.3.1. While with OpenBLAS 0.2.20, we cannot reproduce the issue anymore.

Under GPU context (regardless of OpenBLAS version), we cannot reproduce the issue.

Just tested a pip build where openblas is replaced with 0.2.20, we cannot see the randomness in the forward. Therefore we recommend to ship new pip package for 1.3.0 and 1.2.1 with openblas at 0.2.20 and that should fix the issue.

apache / mxnet

gluon forward produce random results #11853