Open haohuanw opened 5 years ago
Hey, this is the MXNet Label Bot. Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it. Here are my recommended labels: Test
Failed here... just on CI... on what should be unrelated... http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-15885/14/pipeline/303
Also here: http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-16827/5/pipeline Seems like a flaky one with a borderline tolerance value. I'll submit a PR to bump up the tolerance by a little but.
~~Another one : http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-18119/3/pipeline unrelated PR : #18119~~ The error for that pipeline was 403 and not numerical error.
@haohuanw our CI as of 4/21 doesn't really use g4 instances. T4 GPU [Tesla] is used in G4 instances while p3 & g3 instances [currently used for GPU workloads in CI] use Tesla V100 and M60 respectively. So since our CI is failing for this test since June 2019, it looks like an issue related to V and not T4
Correct me if I'm wrong @leezu @josephevans
Description
test_tensorrt_resnet18.test_tensorrt_resnet18_feature_vect succeeded on V100 gpu but got numerical issue on T4 gpu.
Environment info (Required)
Sorry I have to hide the CPU information since I am using a machine under NDA policy.
Package used (Python/R/Scala/Julia): I'm using python api.
Build info (Required if built from source)
Compiler (gcc/clang/mingw/visual studio):
MXNet commit hash: 5fc4fc53df74f276aafa51208142e657e9cfe42d
Build config: built with ./ci/build.py -p ubuntu_gpu_tensorrt
Error Message:
Minimum reproducible example
https://github.com/apache/incubator-mxnet/blob/master/tests/python/tensorrt/test_resnet18.py
Steps to reproduce
What have you tried to solve it?
This seems happened on particular hardware (passed on V100 but failed on T4), so nothing I can really do.