test_all.testbin failed on ThresholdLayerTest & RNNLayerTest

dhzhd1 commented 6 years ago

Issue summary

After run the ./build/test/test_all.testbin, below test items failed: 1) ThresholdLayerTest/3.Test Error Message: src/caffe/test/test_threshold_layer.cpp:67: Failure Expected: (bottom_data[i]) > (threshold_), actual: -0.635736 vs 0 src/caffe/test/test_threshold_layer.cpp:67: Failure Expected: (bottom_data[i]) > (threshold_), actual: -0.363372 vs 0 src/caffe/test/test_threshold_layer.cpp:64: Failure ......

2) RNNLayerTest/2.TestForward Error Message: src/caffe/test/test_rnn_layer.cpp:156: Failure Expected: (this->blob_top_.cpu_data()[i]) != (top_copy.cpu_data()[t * top_count + i]), actual: -0 vs -0 t = 1; i = 0 src/caffe/test/test_rnn_layer.cpp:156: Failure Expected: (this->blob_top_.cpu_data()[i]) != (top_copy.cpu_data()[t * top_count + i]), actual: -0 vs -0 t = 1; i = 1 src/caffe/test/test_rnn_layer.cpp:156: Failure Expected: (this->blob_top_.cpu_data()[i]) != (top_copy.cpu_data()[t * top_count + i]), actual: 0 vs 0 t = 1; i = 2 ......

3) RNNLayerTest/2.TestGradient Error Message: ./include/caffe/test/test_gradient_check_util.hpp:175: Failure The difference between computed_gradient and estimated_gradient is 0.37764036655426025, which exceeds threshold_ * scale, where computed_gradient evaluates to -0.37764036655426025, estimated_gradient evaluates to 0, and threshold_ * scale evaluates to 0.0010000000474974513. debug: (top_id, top_data_id, blob_id, feat_id)=0,0,0,0; feat = 0.24414941668510437; objective+ = -0; objective- = -0 ... ...

4) RNNLayerTest/2.TestGradientNonZeroCont Error Message: ./include/caffe/test/test_gradient_check_util.hpp:175: Failure The difference between computed_gradient and estimated_gradient is 0.37112760543823242, which exceeds threshold_ * scale, where computed_gradient evaluates to 0.37112760543823242, estimated_gradient evaluates to 0, and threshold_ * scale evaluates to 0.0010000000474974513. debug: (top_id, top_data_id, blob_id, feat_id)=0,0,0,0; feat = -0.18513253331184387; objective+ = -0; objective- = -0 ... ....

5) RNNLayerTest/2.TestGradientNonZeroContBufferSize2 Error Message: ./include/caffe/test/test_gradient_check_util.hpp:175: Failure The difference between computed_gradient and estimated_gradient is 0.14533787965774536, which exceeds threshold_ * scale, where computed_gradient evaluates to -0.14533787965774536, estimated_gradient evaluates to 0, and threshold_ * scale evaluates to 0.0010000000474974513. debug: (top_id, top_data_id, blob_id, feat_id)=0,0,0,0; feat = -0.17294931411743164; objective+ = 0; objective- = 0 ... ...

6) RNNLayerTest/2.TestGradientNonZeroContBufferSize2WithStaticInput Error Message: ./include/caffe/test/test_gradient_check_util.hpp:175: Failure The difference between computed_gradient and estimated_gradient is 0.14564625918865204, which exceeds threshold_ * scale, where computed_gradient evaluates to 0.14564625918865204, estimated_gradient evaluates to 0, and threshold_ * scale evaluates to 0.0010000000474974513. debug: (top_id, top_data_id, blob_id, feat_id)=0,0,0,0; feat = 0.20095519721508026; objective+ = -0; objective- = -0 ... ...

7) RNNLayerTest/3.TestForward Error Message: MIOpen Error: /data/repo/MIOpen/src/ocl/activ_ocl.cpp:45: Only alpha=1 and beta=0 is supported

Steps to reproduce

According to the README.ROCm.md build the test_all.testbin. All of the prerequired packages has been install. The LD_LIBRARY_PATH and PATH has been setup.

Your system configuration

GPU: AMD MI25 Operating system: Ubuntu 16.04.3 64bit Compiler: gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.5) CUDA version (if applicable): CUDNN version (if applicable): BLAS: USE_ROCBLAS := 1
Python or MATLAB version (for pycaffe and matcaffe respectively): python 2.7.12 Other: miopen-hip 1.1.4 miopengemm 1.1.5 rocm-libs 1.6.180

parallelo commented 6 years ago

Thanks for the report, @dhzhd1. We'll take a look.

yige-hu commented 6 years ago

Hi Jeff @parallelo ,

I'm still observing the 6th failure. My configurations: Operating System: Ubuntu 16.04.3 LTS, Linux kernel 4.13.0 GPU: AMD RX 580 ROCm backend.

./include/caffe/test/test_gradient_check_util.hpp:175: Failure
The difference between computed_gradient and estimated_gradient is 0.40885764360427856, which exceeds threshold_ * scale, where
computed_gradient evaluates to 0.40885764360427856,
estimated_gradient evaluates to 0, and
threshold_ * scale evaluates to 0.0010000000474974513.
debug: (top_id, top_data_id, blob_id, feat_id)=0,27,6,47; feat = 0.20945753157138824; objective+ = -0; objective- = -0
[  FAILED  ] RNNLayerTest/2.TestGradientNonZeroContBufferSize2WithStaticInput, where TypeParam = caffe::GPUDevice<float> (20358 ms)
....
MIOpen Error: /data/repo/MIOpen/src/ocl/activ_ocl.cpp:47: Only alpha=1 and beta=0 is supported
F0225 20:25:48.964237  9249 cudnn_tanh_layer_hip.cpp:23] Check failed: status == miopenStatusSuccess (7 vs. 0)  miopenStatusUnknownError
*** Check failure stack trace: ***
    @     0x7f2b521295cd  google::LogMessage::Fail()
    @     0x7f2b5212b433  google::LogMessage::SendToLog()
    @     0x7f2b5212915b  google::LogMessage::Flush()
    @     0x7f2b5212be1e  google::LogMessageFatal::~LogMessageFatal()
    @          0x1547cce  caffe::CuDNNTanHLayer<>::Forward_gpu()
    @           0x4f7967  caffe::Layer<>::Forward()
    @          0x1b3a137  caffe::Net<>::ForwardFromTo()
    @          0x1c3ab1a  caffe::RecurrentLayer<>::Forward_gpu()
    @           0x4f7967  caffe::Layer<>::Forward()
    @           0x5b73f2  caffe::RNNLayerTest_TestForward_Test<>::TestBody()
    @          0x108fd14  testing::internal::HandleExceptionsInMethodIfSupported<>()
    @          0x108fbd6  testing::Test::Run()
    @          0x1090d21  testing::TestInfo::Run()
    @          0x1091577  testing::TestCase::Run()
    @          0x1097c57  testing::internal::UnitTestImpl::RunAllTests()
    @          0x1097694  testing::internal::HandleExceptionsInMethodIfSupported<>()
    @          0x1097649  testing::UnitTest::Run()
    @          0x2006fda  main
    @     0x7f2b4d5e4830  __libc_start_main
    @          0x2006479  _start
    @              (nil)  (unknown)
Aborted (core dumped)

Thanks, Yige

davclark commented 6 years ago

I'm also getting a number of failures, which seem in the same ballpark. I'm building on a just-updated checkout of the rocrand branch, followed instructions exactly, with no changes to Makefile.config. Basically the same config as @yige-hu:

Operating System: Ubuntu 16.04.3 LTS, Linux kernel 4.13.0 GPU: AMD RX 580, drivers from rocm PPA CPU: Threadripper 1900X on X399 chipset ROCm backend.

One thing I'm seeing is this "0 ms" note on most of the failures. I'm guessing the operation is simply not running.

Anyway, please let me know if this is a good place to post or if there's a better place!

[ RUN      ] EmbedLayerTest/0.TestGradientWithBias
src/caffe/test/test_embed_layer.cpp:183: Failure
Value of: 1
Expected: 0
[  FAILED  ] EmbedLayerTest/0.TestGradientWithBias, where TypeParam = caffe::CPUDevice<float> (0 ms)

[ RUN      ] EmbedLayerTest/1.TestGradient
src/caffe/test/test_embed_layer.cpp:158: Failure
Value of: 1
Expected: 0
[  FAILED  ] EmbedLayerTest/1.TestGradient, where TypeParam = caffe::CPUDevice<double> (0 ms)

[ RUN      ] EmbedLayerTest/2.TestGradient
src/caffe/test/test_embed_layer.cpp:158: Failure
Value of: 1
Expected: 0
[  FAILED  ] EmbedLayerTest/2.TestGradient, where TypeParam = caffe::GPUDevice<float> (0 ms)

[ RUN      ] EmbedLayerTest/3.TestGradientWithBias
src/caffe/test/test_embed_layer.cpp:183: Failure
Value of: 1
Expected: 0
[  FAILED  ] EmbedLayerTest/3.TestGradientWithBias, where TypeParam = caffe::GPUDevice<double> (0 ms)

[ RUN      ] MaxPoolingDropoutTest/2.TestBackward
src/caffe/test/test_maxpool_dropout_layers.cpp:124: Failure
Expected: (sum_with_dropout) >= (sum), actual: 22 vs 36
[  FAILED  ] MaxPoolingDropoutTest/2.TestBackward, where TypeParam = caffe::GPUDevice<float> (2 ms)

[ RUN      ] ConvolutionLayerTest/0.TestSimple3DConvolution
src/caffe/test/test_convolution_layer.cpp:397: Failure
Value of: 1
Expected: 0
[  FAILED  ] ConvolutionLayerTest/0.TestSimple3DConvolution, where TypeParam = caffe::CPUDevice<float> (0 ms)

[ RUN      ] ConvolutionLayerTest/0.TestDilated3DConvolution
src/caffe/test/test_convolution_layer.cpp:448: Failure
Value of: 1
Expected: 0
[  FAILED  ] ConvolutionLayerTest/0.TestDilated3DConvolution, where TypeParam = caffe::CPUDevice<float> (0 ms)

[ RUN      ] ConvolutionLayerTest/0.TestNDAgainst2D
src/caffe/test/test_convolution_layer.cpp:718: Failure
Value of: 1
Expected: 0
[  FAILED  ] ConvolutionLayerTest/0.TestNDAgainst2D, where TypeParam = caffe::CPUDevice<float> (0 ms)

[ RUN      ] ConvolutionLayerTest/0.TestGradient3D
src/caffe/test/test_convolution_layer.cpp:792: Failure
Value of: 1
Expected: 0
[  FAILED  ] ConvolutionLayerTest/0.TestGradient3D, where TypeParam = caffe::CPUDevice<float> (0 ms)

[ RUN      ] ConvolutionLayerTest/1.TestSimple3DConvolution
src/caffe/test/test_convolution_layer.cpp:397: Failure
Value of: 1
Expected: 0
[  FAILED  ] ConvolutionLayerTest/1.TestSimple3DConvolution, where TypeParam = caffe::CPUDevice<double> (0 ms)

[ RUN      ] ConvolutionLayerTest/1.TestDilated3DConvolution
src/caffe/test/test_convolution_layer.cpp:448: Failure
Value of: 1
Expected: 0
[  FAILED  ] ConvolutionLayerTest/1.TestDilated3DConvolution, where TypeParam = caffe::CPUDevice<double> (0 ms)

[ RUN      ] ConvolutionLayerTest/1.TestNDAgainst2D
src/caffe/test/test_convolution_layer.cpp:718: Failure
Value of: 1
Expected: 0
[  FAILED  ] ConvolutionLayerTest/1.TestNDAgainst2D, where TypeParam = caffe::CPUDevice<double> (0 ms)

[ RUN      ] ConvolutionLayerTest/1.TestGradient3D
src/caffe/test/test_convolution_layer.cpp:792: Failure
Value of: 1
Expected: 0
[  FAILED  ] ConvolutionLayerTest/1.TestGradient3D, where TypeParam = caffe::CPUDevice<double> (0 ms)

[ RUN      ] ConvolutionLayerTest/2.TestSimple3DConvolution
src/caffe/test/test_convolution_layer.cpp:397: Failure
Value of: 1
Expected: 0
[  FAILED  ] ConvolutionLayerTest/2.TestSimple3DConvolution, where TypeParam = caffe::GPUDevice<float> (0 ms)

[ RUN      ] ConvolutionLayerTest/2.TestDilated3DConvolution
src/caffe/test/test_convolution_layer.cpp:448: Failure
Value of: 1
Expected: 0
[  FAILED  ] ConvolutionLayerTest/2.TestDilated3DConvolution, where TypeParam = caffe::GPUDevice<float> (0 ms)

[ RUN      ] ConvolutionLayerTest/2.TestNDAgainst2D
src/caffe/test/test_convolution_layer.cpp:718: Failure
Value of: 1
Expected: 0
[  FAILED  ] ConvolutionLayerTest/2.TestNDAgainst2D, where TypeParam = caffe::GPUDevice<float> (0 ms)

[ RUN      ] ConvolutionLayerTest/2.TestGradient3D
src/caffe/test/test_convolution_layer.cpp:792: Failure
Value of: 1
Expected: 0
[  FAILED  ] ConvolutionLayerTest/2.TestGradient3D, where TypeParam = caffe::GPUDevice<float> (0 ms)

[ RUN      ] ConvolutionLayerTest/3.TestSimple3DConvolution
src/caffe/test/test_convolution_layer.cpp:397: Failure
Value of: 1
Expected: 0
[  FAILED  ] ConvolutionLayerTest/3.TestSimple3DConvolution, where TypeParam = caffe::GPUDevice<double> (1 ms)

[ RUN      ] ConvolutionLayerTest/3.TestDilated3DConvolution
src/caffe/test/test_convolution_layer.cpp:448: Failure
Value of: 1
Expected: 0
[  FAILED  ] ConvolutionLayerTest/3.TestDilated3DConvolution, where TypeParam = caffe::GPUDevice<double> (0 ms)

[ RUN      ] ConvolutionLayerTest/3.TestNDAgainst2D
src/caffe/test/test_convolution_layer.cpp:718: Failure
Value of: 1
Expected: 0
[  FAILED  ] ConvolutionLayerTest/3.TestNDAgainst2D, where TypeParam = caffe::GPUDevice<double> (0 ms)

[ RUN      ] NeuronLayerTest/2.TestBNLLGradient
./include/caffe/test/test_gradient_check_util.hpp:175: Failure
The difference between computed_gradient and estimated_gradient is 0.054580926895141602, which exceeds threshold_ * scale, where
computed_gradient evaluates to 1,
estimated_gradient evaluates to 0.9454190731048584, and
threshold_ * scale evaluates to 0.0010000000474974513.
debug: (top_id, top_data_id, blob_id, feat_id)=0,0,0,0; feat = -0.10926593095064163; objective+ = 1.289490818977356; objective- = 1.2705824375152588
<snipped a bunch more like that...>
[  FAILED  ] NeuronLayerTest/2.TestBNLLGradient, where TypeParam = caffe::GPUDevice<float> (67 ms)

[ RUN      ] NeuronLayerTest/3.TestDropoutHalf
src/caffe/test/test_neuron_layer.cpp:87: Failure
The difference between empirical_dropout_ratio and dropout_ratio is 0.5, which exceeds 1.96 * std_error, where
empirical_dropout_ratio evaluates to 1,
dropout_ratio evaluates to 0.5, and
1.96 * std_error evaluates to 0.089461353392063625.
[  FAILED  ] NeuronLayerTest/3.TestDropoutHalf, where TypeParam = caffe::GPUDevice<double> (1 ms)

[ RUN      ] NeuronLayerTest/3.TestDropoutThreeQuarters
src/caffe/test/test_neuron_layer.cpp:87: Failure
The difference between empirical_dropout_ratio and dropout_ratio is 0.25, which exceeds 1.96 * std_error, where
empirical_dropout_ratio evaluates to 1,
dropout_ratio evaluates to 0.75, and
1.96 * std_error evaluates to 0.077475803251365008.
[  FAILED  ] NeuronLayerTest/3.TestDropoutThreeQuarters, where TypeParam = caffe::GPUDevice<double> (1 ms)

[ RUN      ] NeuronLayerTest/3.TestBNLLGradient
./include/caffe/test/test_gradient_check_util.hpp:175: Failure
The difference between computed_gradient and estimated_gradient is 0.091322861619979268, which exceeds threshold_ * scale, where
computed_gradient evaluates to 1,
estimated_gradient evaluates to 0.90867713838002073, and
threshold_ * scale evaluates to 0.001.
debug: (top_id, top_data_id, blob_id, feat_id)=0,0,0,0; feat = -0.18315754813835755; objective+ = 1.220623351064388; objective- = 1.2024498082967876
<again snipping many repeats...>
[  FAILED  ] NeuronLayerTest/3.TestBNLLGradient, where TypeParam = caffe::GPUDevice<double> (63 ms)

[ RUN      ] NetTest/0.TestReshape
Segmentation fault (core dumped)

You can see that there's a core dump there at the end!

davclark commented 6 years ago

Just updated to the hip branch, which doesn't seem to have many meaningful changes over the rocrand branch. The same errors persist. I can also report that MNIST and CaffeNet also fail, both with core dumps.

parallelo commented 6 years ago

@davclark - Thanks for the heads-up. Please open a new ticket for the core dumps, as that appears to be a separate issue.

ROCm / hipCaffe