Open dhzhd1 opened 6 years ago
Thanks for the report, @dhzhd1. We'll take a look.
Hi Jeff @parallelo ,
I'm still observing the 6th failure. My configurations: Operating System: Ubuntu 16.04.3 LTS, Linux kernel 4.13.0 GPU: AMD RX 580 ROCm backend.
./include/caffe/test/test_gradient_check_util.hpp:175: Failure
The difference between computed_gradient and estimated_gradient is 0.40885764360427856, which exceeds threshold_ * scale, where
computed_gradient evaluates to 0.40885764360427856,
estimated_gradient evaluates to 0, and
threshold_ * scale evaluates to 0.0010000000474974513.
debug: (top_id, top_data_id, blob_id, feat_id)=0,27,6,47; feat = 0.20945753157138824; objective+ = -0; objective- = -0
[ FAILED ] RNNLayerTest/2.TestGradientNonZeroContBufferSize2WithStaticInput, where TypeParam = caffe::GPUDevice<float> (20358 ms)
....
MIOpen Error: /data/repo/MIOpen/src/ocl/activ_ocl.cpp:47: Only alpha=1 and beta=0 is supported
F0225 20:25:48.964237 9249 cudnn_tanh_layer_hip.cpp:23] Check failed: status == miopenStatusSuccess (7 vs. 0) miopenStatusUnknownError
*** Check failure stack trace: ***
@ 0x7f2b521295cd google::LogMessage::Fail()
@ 0x7f2b5212b433 google::LogMessage::SendToLog()
@ 0x7f2b5212915b google::LogMessage::Flush()
@ 0x7f2b5212be1e google::LogMessageFatal::~LogMessageFatal()
@ 0x1547cce caffe::CuDNNTanHLayer<>::Forward_gpu()
@ 0x4f7967 caffe::Layer<>::Forward()
@ 0x1b3a137 caffe::Net<>::ForwardFromTo()
@ 0x1c3ab1a caffe::RecurrentLayer<>::Forward_gpu()
@ 0x4f7967 caffe::Layer<>::Forward()
@ 0x5b73f2 caffe::RNNLayerTest_TestForward_Test<>::TestBody()
@ 0x108fd14 testing::internal::HandleExceptionsInMethodIfSupported<>()
@ 0x108fbd6 testing::Test::Run()
@ 0x1090d21 testing::TestInfo::Run()
@ 0x1091577 testing::TestCase::Run()
@ 0x1097c57 testing::internal::UnitTestImpl::RunAllTests()
@ 0x1097694 testing::internal::HandleExceptionsInMethodIfSupported<>()
@ 0x1097649 testing::UnitTest::Run()
@ 0x2006fda main
@ 0x7f2b4d5e4830 __libc_start_main
@ 0x2006479 _start
@ (nil) (unknown)
Aborted (core dumped)
Thanks, Yige
I'm also getting a number of failures, which seem in the same ballpark. I'm building on a just-updated checkout of the rocrand
branch, followed instructions exactly, with no changes to Makefile.config. Basically the same config as @yige-hu:
Operating System: Ubuntu 16.04.3 LTS, Linux kernel 4.13.0 GPU: AMD RX 580, drivers from rocm PPA CPU: Threadripper 1900X on X399 chipset ROCm backend.
One thing I'm seeing is this "0 ms" note on most of the failures. I'm guessing the operation is simply not running.
Anyway, please let me know if this is a good place to post or if there's a better place!
[ RUN ] EmbedLayerTest/0.TestGradientWithBias
src/caffe/test/test_embed_layer.cpp:183: Failure
Value of: 1
Expected: 0
[ FAILED ] EmbedLayerTest/0.TestGradientWithBias, where TypeParam = caffe::CPUDevice<float> (0 ms)
[ RUN ] EmbedLayerTest/1.TestGradient
src/caffe/test/test_embed_layer.cpp:158: Failure
Value of: 1
Expected: 0
[ FAILED ] EmbedLayerTest/1.TestGradient, where TypeParam = caffe::CPUDevice<double> (0 ms)
[ RUN ] EmbedLayerTest/2.TestGradient
src/caffe/test/test_embed_layer.cpp:158: Failure
Value of: 1
Expected: 0
[ FAILED ] EmbedLayerTest/2.TestGradient, where TypeParam = caffe::GPUDevice<float> (0 ms)
[ RUN ] EmbedLayerTest/3.TestGradientWithBias
src/caffe/test/test_embed_layer.cpp:183: Failure
Value of: 1
Expected: 0
[ FAILED ] EmbedLayerTest/3.TestGradientWithBias, where TypeParam = caffe::GPUDevice<double> (0 ms)
[ RUN ] MaxPoolingDropoutTest/2.TestBackward
src/caffe/test/test_maxpool_dropout_layers.cpp:124: Failure
Expected: (sum_with_dropout) >= (sum), actual: 22 vs 36
[ FAILED ] MaxPoolingDropoutTest/2.TestBackward, where TypeParam = caffe::GPUDevice<float> (2 ms)
[ RUN ] ConvolutionLayerTest/0.TestSimple3DConvolution
src/caffe/test/test_convolution_layer.cpp:397: Failure
Value of: 1
Expected: 0
[ FAILED ] ConvolutionLayerTest/0.TestSimple3DConvolution, where TypeParam = caffe::CPUDevice<float> (0 ms)
[ RUN ] ConvolutionLayerTest/0.TestDilated3DConvolution
src/caffe/test/test_convolution_layer.cpp:448: Failure
Value of: 1
Expected: 0
[ FAILED ] ConvolutionLayerTest/0.TestDilated3DConvolution, where TypeParam = caffe::CPUDevice<float> (0 ms)
[ RUN ] ConvolutionLayerTest/0.TestNDAgainst2D
src/caffe/test/test_convolution_layer.cpp:718: Failure
Value of: 1
Expected: 0
[ FAILED ] ConvolutionLayerTest/0.TestNDAgainst2D, where TypeParam = caffe::CPUDevice<float> (0 ms)
[ RUN ] ConvolutionLayerTest/0.TestGradient3D
src/caffe/test/test_convolution_layer.cpp:792: Failure
Value of: 1
Expected: 0
[ FAILED ] ConvolutionLayerTest/0.TestGradient3D, where TypeParam = caffe::CPUDevice<float> (0 ms)
[ RUN ] ConvolutionLayerTest/1.TestSimple3DConvolution
src/caffe/test/test_convolution_layer.cpp:397: Failure
Value of: 1
Expected: 0
[ FAILED ] ConvolutionLayerTest/1.TestSimple3DConvolution, where TypeParam = caffe::CPUDevice<double> (0 ms)
[ RUN ] ConvolutionLayerTest/1.TestDilated3DConvolution
src/caffe/test/test_convolution_layer.cpp:448: Failure
Value of: 1
Expected: 0
[ FAILED ] ConvolutionLayerTest/1.TestDilated3DConvolution, where TypeParam = caffe::CPUDevice<double> (0 ms)
[ RUN ] ConvolutionLayerTest/1.TestNDAgainst2D
src/caffe/test/test_convolution_layer.cpp:718: Failure
Value of: 1
Expected: 0
[ FAILED ] ConvolutionLayerTest/1.TestNDAgainst2D, where TypeParam = caffe::CPUDevice<double> (0 ms)
[ RUN ] ConvolutionLayerTest/1.TestGradient3D
src/caffe/test/test_convolution_layer.cpp:792: Failure
Value of: 1
Expected: 0
[ FAILED ] ConvolutionLayerTest/1.TestGradient3D, where TypeParam = caffe::CPUDevice<double> (0 ms)
[ RUN ] ConvolutionLayerTest/2.TestSimple3DConvolution
src/caffe/test/test_convolution_layer.cpp:397: Failure
Value of: 1
Expected: 0
[ FAILED ] ConvolutionLayerTest/2.TestSimple3DConvolution, where TypeParam = caffe::GPUDevice<float> (0 ms)
[ RUN ] ConvolutionLayerTest/2.TestDilated3DConvolution
src/caffe/test/test_convolution_layer.cpp:448: Failure
Value of: 1
Expected: 0
[ FAILED ] ConvolutionLayerTest/2.TestDilated3DConvolution, where TypeParam = caffe::GPUDevice<float> (0 ms)
[ RUN ] ConvolutionLayerTest/2.TestNDAgainst2D
src/caffe/test/test_convolution_layer.cpp:718: Failure
Value of: 1
Expected: 0
[ FAILED ] ConvolutionLayerTest/2.TestNDAgainst2D, where TypeParam = caffe::GPUDevice<float> (0 ms)
[ RUN ] ConvolutionLayerTest/2.TestGradient3D
src/caffe/test/test_convolution_layer.cpp:792: Failure
Value of: 1
Expected: 0
[ FAILED ] ConvolutionLayerTest/2.TestGradient3D, where TypeParam = caffe::GPUDevice<float> (0 ms)
[ RUN ] ConvolutionLayerTest/3.TestSimple3DConvolution
src/caffe/test/test_convolution_layer.cpp:397: Failure
Value of: 1
Expected: 0
[ FAILED ] ConvolutionLayerTest/3.TestSimple3DConvolution, where TypeParam = caffe::GPUDevice<double> (1 ms)
[ RUN ] ConvolutionLayerTest/3.TestDilated3DConvolution
src/caffe/test/test_convolution_layer.cpp:448: Failure
Value of: 1
Expected: 0
[ FAILED ] ConvolutionLayerTest/3.TestDilated3DConvolution, where TypeParam = caffe::GPUDevice<double> (0 ms)
[ RUN ] ConvolutionLayerTest/3.TestNDAgainst2D
src/caffe/test/test_convolution_layer.cpp:718: Failure
Value of: 1
Expected: 0
[ FAILED ] ConvolutionLayerTest/3.TestNDAgainst2D, where TypeParam = caffe::GPUDevice<double> (0 ms)
[ RUN ] NeuronLayerTest/2.TestBNLLGradient
./include/caffe/test/test_gradient_check_util.hpp:175: Failure
The difference between computed_gradient and estimated_gradient is 0.054580926895141602, which exceeds threshold_ * scale, where
computed_gradient evaluates to 1,
estimated_gradient evaluates to 0.9454190731048584, and
threshold_ * scale evaluates to 0.0010000000474974513.
debug: (top_id, top_data_id, blob_id, feat_id)=0,0,0,0; feat = -0.10926593095064163; objective+ = 1.289490818977356; objective- = 1.2705824375152588
<snipped a bunch more like that...>
[ FAILED ] NeuronLayerTest/2.TestBNLLGradient, where TypeParam = caffe::GPUDevice<float> (67 ms)
[ RUN ] NeuronLayerTest/3.TestDropoutHalf
src/caffe/test/test_neuron_layer.cpp:87: Failure
The difference between empirical_dropout_ratio and dropout_ratio is 0.5, which exceeds 1.96 * std_error, where
empirical_dropout_ratio evaluates to 1,
dropout_ratio evaluates to 0.5, and
1.96 * std_error evaluates to 0.089461353392063625.
[ FAILED ] NeuronLayerTest/3.TestDropoutHalf, where TypeParam = caffe::GPUDevice<double> (1 ms)
[ RUN ] NeuronLayerTest/3.TestDropoutThreeQuarters
src/caffe/test/test_neuron_layer.cpp:87: Failure
The difference between empirical_dropout_ratio and dropout_ratio is 0.25, which exceeds 1.96 * std_error, where
empirical_dropout_ratio evaluates to 1,
dropout_ratio evaluates to 0.75, and
1.96 * std_error evaluates to 0.077475803251365008.
[ FAILED ] NeuronLayerTest/3.TestDropoutThreeQuarters, where TypeParam = caffe::GPUDevice<double> (1 ms)
[ RUN ] NeuronLayerTest/3.TestBNLLGradient
./include/caffe/test/test_gradient_check_util.hpp:175: Failure
The difference between computed_gradient and estimated_gradient is 0.091322861619979268, which exceeds threshold_ * scale, where
computed_gradient evaluates to 1,
estimated_gradient evaluates to 0.90867713838002073, and
threshold_ * scale evaluates to 0.001.
debug: (top_id, top_data_id, blob_id, feat_id)=0,0,0,0; feat = -0.18315754813835755; objective+ = 1.220623351064388; objective- = 1.2024498082967876
<again snipping many repeats...>
[ FAILED ] NeuronLayerTest/3.TestBNLLGradient, where TypeParam = caffe::GPUDevice<double> (63 ms)
[ RUN ] NetTest/0.TestReshape
Segmentation fault (core dumped)
You can see that there's a core dump there at the end!
Just updated to the hip
branch, which doesn't seem to have many meaningful changes over the rocrand branch. The same errors persist. I can also report that MNIST and CaffeNet also fail, both with core dumps.
@davclark - Thanks for the heads-up. Please open a new ticket for the core dumps, as that appears to be a separate issue.
Issue summary
After run the ./build/test/test_all.testbin, below test items failed: 1) ThresholdLayerTest/3.Test Error Message:
src/caffe/test/test_threshold_layer.cpp:67: Failure
Expected: (bottom_data[i]) > (threshold_), actual: -0.635736 vs 0
src/caffe/test/test_threshold_layer.cpp:67: Failure
Expected: (bottom_data[i]) > (threshold_), actual: -0.363372 vs 0
src/caffe/test/test_threshold_layer.cpp:64: Failure
......2) RNNLayerTest/2.TestForward Error Message:
src/caffe/test/test_rnn_layer.cpp:156: Failure
Expected: (this->blob_top_.cpu_data()[i]) != (top_copy.cpu_data()[t * top_count + i]), actual: -0 vs -0 t = 1; i = 0
src/caffe/test/test_rnn_layer.cpp:156: Failure
Expected: (this->blob_top_.cpu_data()[i]) != (top_copy.cpu_data()[t * top_count + i]), actual: -0 vs -0 t = 1; i = 1
src/caffe/test/test_rnn_layer.cpp:156: Failure
Expected: (this->blob_top_.cpu_data()[i]) != (top_copy.cpu_data()[t * top_count + i]), actual: 0 vs 0 t = 1; i = 2
......3) RNNLayerTest/2.TestGradient Error Message:
./include/caffe/test/test_gradient_check_util.hpp:175: Failure
The difference between computed_gradient and estimated_gradient is 0.37764036655426025, which exceeds threshold_ * scale, where computed_gradient evaluates to -0.37764036655426025, estimated_gradient evaluates to 0, and threshold_ * scale evaluates to 0.0010000000474974513.
debug: (top_id, top_data_id, blob_id, feat_id)=0,0,0,0; feat = 0.24414941668510437; objective+ = -0; objective- = -0
... ...4) RNNLayerTest/2.TestGradientNonZeroCont Error Message:
./include/caffe/test/test_gradient_check_util.hpp:175: Failure
The difference between computed_gradient and estimated_gradient is 0.37112760543823242, which exceeds threshold_ * scale, where computed_gradient evaluates to 0.37112760543823242, estimated_gradient evaluates to 0, and threshold_ * scale evaluates to 0.0010000000474974513.
debug: (top_id, top_data_id, blob_id, feat_id)=0,0,0,0; feat = -0.18513253331184387; objective+ = -0; objective- = -0
... ....5) RNNLayerTest/2.TestGradientNonZeroContBufferSize2 Error Message:
./include/caffe/test/test_gradient_check_util.hpp:175: Failure
The difference between computed_gradient and estimated_gradient is 0.14533787965774536, which exceeds threshold_ * scale, where computed_gradient evaluates to -0.14533787965774536, estimated_gradient evaluates to 0, and threshold_ * scale evaluates to 0.0010000000474974513.
debug: (top_id, top_data_id, blob_id, feat_id)=0,0,0,0; feat = -0.17294931411743164; objective+ = 0; objective- = 0
... ...6) RNNLayerTest/2.TestGradientNonZeroContBufferSize2WithStaticInput Error Message:
./include/caffe/test/test_gradient_check_util.hpp:175: Failure
The difference between computed_gradient and estimated_gradient is 0.14564625918865204, which exceeds threshold_ * scale, where computed_gradient evaluates to 0.14564625918865204, estimated_gradient evaluates to 0, and threshold_ * scale evaluates to 0.0010000000474974513.
debug: (top_id, top_data_id, blob_id, feat_id)=0,0,0,0; feat = 0.20095519721508026; objective+ = -0; objective- = -0
... ...7) RNNLayerTest/3.TestForward Error Message:
MIOpen Error: /data/repo/MIOpen/src/ocl/activ_ocl.cpp:45: Only alpha=1 and beta=0 is supported
Steps to reproduce
According to the README.ROCm.md build the test_all.testbin. All of the prerequired packages has been install. The LD_LIBRARY_PATH and PATH has been setup.
Your system configuration
GPU: AMD MI25 Operating system: Ubuntu 16.04.3 64bit Compiler: gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.5) CUDA version (if applicable): CUDNN version (if applicable): BLAS: USE_ROCBLAS := 1
Python or MATLAB version (for pycaffe and matcaffe respectively): python 2.7.12 Other: miopen-hip 1.1.4 miopengemm 1.1.5 rocm-libs 1.6.180