BVLC / caffe

Caffe: a fast open framework for deep learning.
http://caffe.berkeleyvision.org/
Other
34.04k stars 18.7k forks source link

[OPENCL]failed tests after make runtest with libdnn enabled #5535

Open yewang0320 opened 7 years ago

yewang0320 commented 7 years ago

Issue summary

I was able to successfully compile the opencl branch caffe. When I enable libdnn for compiling and make runtest, these fails will occur:

[  FAILED  ] 60 tests, listed below:
[  FAILED  ] LibDNNConvolutionLayerTest/0.TestSimpleConvolutionGroupLibDNN, where TypeParam = f
[  FAILED  ] LibDNNConvolutionLayerTest/0.TestSobelConvolutionLibDNN, where TypeParam = f
[  FAILED  ] LibDNNConvolutionLayerTest/0.TestGradientGroupLibDNN, where TypeParam = f
[  FAILED  ] LibDNNConvolutionLayerTest/0.TestSimpleConvolutionLibDNN, where TypeParam = f
[  FAILED  ] LibDNNConvolutionLayerTest/0.TestGradientLibDNN, where TypeParam = f
[  FAILED  ] LibDNNConvolutionLayerTest/1.TestSimpleConvolutionGroupLibDNN, where TypeParam = d
[  FAILED  ] LibDNNConvolutionLayerTest/1.TestSobelConvolutionLibDNN, where TypeParam = d
[  FAILED  ] LibDNNConvolutionLayerTest/1.TestGradientLibDNN, where TypeParam = d
[  FAILED  ] LibDNNConvolutionLayerTest/1.TestGradientGroupLibDNN, where TypeParam = d
[  FAILED  ] LibDNNConvolutionLayerTest/1.TestSimpleConvolutionLibDNN, where TypeParam = d
[  FAILED  ] LibDNNConvolutionNDLayerTest/0.TestBackward, where TypeParam = f
[  FAILED  ] LibDNNConvolutionNDLayerTest/0.TestForward, where TypeParam = f
[  FAILED  ] LibDNNConvolutionNDLayerTest/1.TestBackward, where TypeParam = d
[  FAILED  ] LibDNNConvolutionNDLayerTest/1.TestForward, where TypeParam = d
[  FAILED  ] LibDNNComparativeConvTest/0.TestBackward, where TypeParam = f
[  FAILED  ] LibDNNComparativeConvTest/0.TestForward, where TypeParam = f
[  FAILED  ] LibDNNComparativeConvTest/1.TestForward, where TypeParam = d
[  FAILED  ] LibDNNComparativeConvTest/1.TestBackward, where TypeParam = d
[  FAILED  ] LibDNNDeconvolutionLayerTest/0.TestGradient3D, where TypeParam = f
[  FAILED  ] LibDNNDeconvolutionLayerTest/0.TestSimpleDeconvolution, where TypeParam = f
[  FAILED  ] LibDNNDeconvolutionLayerTest/0.TestGradient, where TypeParam = f
[  FAILED  ] LibDNNDeconvolutionLayerTest/1.TestGradient, where TypeParam = d
[  FAILED  ] LibDNNDeconvolutionLayerTest/1.TestGradient3D, where TypeParam = d
[  FAILED  ] LibDNNDeconvolutionLayerTest/1.TestSimpleDeconvolution, where TypeParam = d
[  FAILED  ] LibDNNComparativeDeconvTest/0.TestBackward, where TypeParam = f
[  FAILED  ] LibDNNComparativeDeconvTest/0.TestForward, where TypeParam = f
[  FAILED  ] LibDNNComparativeDeconvTest/1.TestForward, where TypeParam = d
[  FAILED  ] LibDNNComparativeDeconvTest/1.TestBackward, where TypeParam = d
[  FAILED  ] LibDNNPoolingLayerTest/0.TestGradientMax, where TypeParam = f
[  FAILED  ] LibDNNPoolingLayerTest/0.TestForwardMaxPadded, where TypeParam = f
[  FAILED  ] LibDNNPoolingLayerTest/0.TestGradientMaxTopMask, where TypeParam = f
[  FAILED  ] LibDNNPoolingLayerTest/0.TestForwardMaxTopMask, where TypeParam = f
[  FAILED  ] LibDNNPoolingLayerTest/0.TestGradientAve, where TypeParam = f
[  FAILED  ] LibDNNPoolingLayerTest/0.TestForwardAve, where TypeParam = f
[  FAILED  ] LibDNNPoolingLayerTest/0.TestGradientAvePadded, where TypeParam = f
[  FAILED  ] LibDNNPoolingLayerTest/0.TestForwardMax, where TypeParam = f
[  FAILED  ] LibDNNPoolingLayerTest/1.TestGradientAvePadded, where TypeParam = d
[  FAILED  ] LibDNNPoolingLayerTest/1.TestForwardMaxTopMask, where TypeParam = d
[  FAILED  ] LibDNNPoolingLayerTest/1.TestForwardMaxPadded, where TypeParam = d
[  FAILED  ] LibDNNPoolingLayerTest/1.TestForwardAve, where TypeParam = d
[  FAILED  ] LibDNNPoolingLayerTest/1.TestForwardMax, where TypeParam = d
[  FAILED  ] LibDNNPoolingLayerTest/1.TestGradientMaxTopMask, where TypeParam = d
[  FAILED  ] LibDNNPoolingLayerTest/1.TestGradientMax, where TypeParam = d
[  FAILED  ] LibDNNPoolingLayerTest/1.TestGradientAve, where TypeParam = d
[  FAILED  ] LibDNNPoolingLayerNDTest/0.TestForward, where TypeParam = f
[  FAILED  ] LibDNNPoolingLayerNDTest/0.TestBackward, where TypeParam = f
[  FAILED  ] LibDNNPoolingLayerNDTest/1.TestBackward, where TypeParam = d
[  FAILED  ] LibDNNPoolingLayerNDTest/1.TestForward, where TypeParam = d
[  FAILED  ] LibDNNComparativePoolTest/0.TestBackward, where TypeParam = f
[  FAILED  ] LibDNNComparativePoolTest/0.TestForward, where TypeParam = f
[  FAILED  ] LibDNNComparativePoolTest/1.TestForward, where TypeParam = d
[  FAILED  ] LibDNNComparativePoolTest/1.TestBackward, where TypeParam = d
[  FAILED  ] LRNLayerTest/2.TestForwardAcrossChannels, where TypeParam = N5caffe9GPUDeviceIfEE
[  FAILED  ] LRNLayerTest/2.TestGradientAcrossChannels, where TypeParam = N5caffe9GPUDeviceIfEE

I tested the built pycaffe by:

import os
import caffe
USE_GPU = True

if USE_GPU:
    caffe.set_device(0)
    caffe.set_mode_gpu()

else:
    caffe.set_mode_cpu()

#print("Initialized caffe")

solver_file = "examples/mnist/lenet_solver.prototxt"
solver = caffe.SGDSolver(solver_file)

solver.solve()

I get the following error:

ViennaCL: FATAL ERROR: Kernel start failed for 'conv_forward'.
std::exception
Traceback (most recent call last):
  File "caffe_test.py", line 17, in <module>
    solver.solve()
SystemError: NULL result without error in PyObject_Call

Steps to reproduce

My makefile.config is (only including relevant parts):

# 32 bit / 64 bit indexing
# USE_INDEX_64 := 1

# GreenTea (ViennaCL/OpenCL) backend switch

# Enable the CUDA backend
# USE_CUDA := 1

# Enable the OpenCL/Greentea backend
USE_GREENTEA := 1
DISABLE_DEVICE_HOST_UNIFIED_MEMORY := 0
# Enable the Greentea-LibDNN convolution backend
USE_LIBDNN := 1

# Folder of the ViennaCL header-only library
VIENNACL_DIR = ../ViennaCL

# CPU OpenMP switch. Do not use OpenMP on dual socket systems!
USE_OPENMP := 1

# BLAS choice:
# atlas for ATLAS (default)
# mkl for MKL
# open for OpenBlas
BLAS := atlas

Your system configuration

Operating system: OSX 10.12 Compiler: clang CUDA version (if applicable): N/A CUDNN version (if applicable): N/A BLAS: ViennalCL (tried clBLAS, the same) Python or MATLAB version (for pycaffe and matcaffe respectively): 2.7.13

It works with libdnn disabled, but not even as fast as cpu mode. Thank you

yewang0320 commented 7 years ago

It now miraculously works on my secondary GPU when I caffe.set_device(1), and much faster than CPU only mode! Still wondering why the other GPU is reporting error, not enough memory?

Also, there is no workaround to enable two AMD GPUs working together?

naibaf7 commented 7 years ago

Currently you can only run the GPUs separately, however there are some updates planned. Can you please also show the output of the following command: ./build/tools/caffe device_query It might be that your GPU number 0 is in fact a CPU with OpenCL support or an onboard GPU that does not support LibDNN.

yewang0320 commented 7 years ago

You are right Fabion, set gpu device to 2 solve the problem. Thank you.