apache / mxnet

Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more
https://mxnet.apache.org
Apache License 2.0
20.78k stars 6.79k forks source link

test_tensorrt_resnet18.test_tensorrt_resnet18_feature_vect numerical error on T4 gpu #15229

Open haohuanw opened 5 years ago

haohuanw commented 5 years ago

Description

test_tensorrt_resnet18.test_tensorrt_resnet18_feature_vect succeeded on V100 gpu but got numerical issue on T4 gpu.

Environment info (Required)

Sorry I have to hide the CPU information since I am using a machine under NDA policy.

root@b0a2b9b22fac:/work/mxnet# python3 diagnose.py 
----------Python Info----------
Version      : 3.6.7
Compiler     : GCC 8.2.0
Build        : ('default', 'Oct 22 2018 11:32:17')
Arch         : ('64bit', 'ELF')
------------Pip Info-----------
Version      : 19.1.1
Directory    : /usr/local/lib/python3.6/dist-packages/pip
----------MXNet Info-----------
Version      : 1.5.0
Directory    : /usr/local/lib/python3.6/dist-packages/mxnet
Hashtag not found. Not installed from pre-built package.
----------System Info----------
Platform     : Linux-4.14.106-79.86.xxxxxxx-x86_64-with-Ubuntu-18.04-bionic
system       : Linux
node         : b0a2b9b22fac
release      : 4.14.106-79.86.xxxxxxx.x86_64
version      : #1 SMP Tue Mar 19 00:48:07 UTC 2019
----------Hardware Info----------
machine      : x86_64
processor    : x86_64
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
Vendor ID:           GenuineIntel
Model name:          Intel(R) Xeon(R) xxxxx CPU @ xxxxGHz
----------Network Test----------
Setting timeout: 10
Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0016 sec, LOAD: 0.4355 sec.
Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.1399 sec, LOAD: 0.1254 sec.
Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 0.2835 sec, LOAD: 0.5443 sec.
Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0170 sec, LOAD: 0.1960 sec.
Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0025 sec, LOAD: 0.0719 sec.
Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0038 sec, LOAD: 0.0386 sec.

Package used (Python/R/Scala/Julia): I'm using python api.

Build info (Required if built from source)

Compiler (gcc/clang/mingw/visual studio):

MXNet commit hash: 5fc4fc53df74f276aafa51208142e657e9cfe42d

Build config: built with ./ci/build.py -p ubuntu_gpu_tensorrt

Error Message:

[00:56:21] /work/mxnet/src/operator/subgraph/build_subgraph.cc:686: start to execute partition graph.
[00:56:21] /work/mxnet/src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:97: Running performance tests to find the best convolution algorithm, this can take a while... (set the environment variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
F
======================================================================
FAIL: trt.test_tensorrt_resnet18_feature_vect
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/nose/case.py", line 198, in runTest
    self.test(*self.arg)
  File "/work/mxnet/trt.py", line 67, in test_tensorrt_resnet18_feature_vect
    assert_almost_equal(no_trt_output, trt_output, 1e-1, 1e-2)
  File "/usr/local/lib/python3.6/dist-packages/mxnet/test_utils.py", line 503, in assert_almost_equal
    raise AssertionError(msg)
AssertionError: 
Items are not equal:
Error 1.155339 exceeds tolerance rtol=0.100000, atol=0.010000.  Location of maximum error:(939,), a=0.006451, b=0.020357
 a: array([ 0.36684752,  2.8859496 , -0.5449833 , ..., -0.77340114,
        2.9310114 ,  1.8106201 ], dtype=float32)
 b: array([ 0.3561221 ,  2.8805661 , -0.55245507, ..., -0.77211773,
        2.9371974 ,  1.8207407 ], dtype=float32)
-------------------- >> begin captured stdout << ---------------------
downloading sample input
Downloading /root/.mxnet/models/resnet18_v2-a81db45f.zip from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/models/resnet18_v2-a81db45f.zip...

--------------------- >> end captured stdout << ----------------------
-------------------- >> begin captured logging << --------------------
urllib3.connectionpool: DEBUG: Starting new HTTPS connection (1): github.com
urllib3.connectionpool: DEBUG: https://github.com:443 "GET /dmlc/web-data/blob/master/mxnet/doc/tutorials/python/predict_image/cat.jpg?raw=true HTTP/1.1" 302 None
urllib3.connectionpool: DEBUG: https://github.com:443 "GET /dmlc/web-data/raw/master/mxnet/doc/tutorials/python/predict_image/cat.jpg HTTP/1.1" 302 169
urllib3.connectionpool: DEBUG: Starting new HTTPS connection (1): raw.githubusercontent.com
urllib3.connectionpool: DEBUG: https://raw.githubusercontent.com:443 "GET /dmlc/web-data/master/mxnet/doc/tutorials/python/predict_image/cat.jpg HTTP/1.1" 200 227791
root: INFO: downloaded https://github.com/dmlc/web-data/blob/master/mxnet/doc/tutorials/python/predict_image/cat.jpg?raw=true into cat.jpg successfully
root: INFO: Model file not found. Downloading to /root/.mxnet/models/resnet18_v2-a81db45f.params.
urllib3.connectionpool: DEBUG: Starting new HTTPS connection (1): apache-mxnet.s3-accelerate.dualstack.amazonaws.com
urllib3.connectionpool: DEBUG: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com:443 "GET /gluon/models/resnet18_v2-a81db45f.zip HTTP/1.1" 200 43433557
--------------------- >> end captured logging << ---------------------

----------------------------------------------------------------------
Ran 1 test in 69.297s

FAILED (failures=1)

Minimum reproducible example

https://github.com/apache/incubator-mxnet/blob/master/tests/python/tensorrt/test_resnet18.py

Steps to reproduce

  1. pull from https://cloud.docker.com/u/haohuanw/repository/docker/haohuanw/trtdebug-turing
  2. python3.6 test/python/tensorrt/test_resnet18.py

What have you tried to solve it?

This seems happened on particular hardware (passed on V100 but failed on T4), so nothing I can really do.

mxnet-label-bot commented 5 years ago

Hey, this is the MXNet Label Bot. Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it. Here are my recommended labels: Test

aaronmarkham commented 5 years ago

Failed here... just on CI... on what should be unrelated... http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-15885/14/pipeline/303

haojin2 commented 5 years ago

Also here: http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-16827/5/pipeline Seems like a flaky one with a borderline tolerance value. I'll submit a PR to bump up the tolerance by a little but.

TaoLv commented 4 years ago

Still failing: http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-17021/1/pipeline

ChaiBapchya commented 4 years ago

~~Another one : http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-18119/3/pipeline unrelated PR : #18119~~ The error for that pipeline was 403 and not numerical error.

ChaiBapchya commented 4 years ago

@haohuanw our CI as of 4/21 doesn't really use g4 instances. T4 GPU [Tesla] is used in G4 instances while p3 & g3 instances [currently used for GPU workloads in CI] use Tesla V100 and M60 respectively. So since our CI is failing for this test since June 2019, it looks like an issue related to V and not T4

Correct me if I'm wrong @leezu @josephevans