Unsuccessful Build on A10-7850K, please help!

thornhale commented 7 years ago

This is a follow-up on a previous message. I am encountering build errors, and don't seem to be able to find the source of it.

I have followed the following steps that I believe your colleague posted here:

https://www.codeplay.com/portal/03-30-17-setting-up-tensorflow-with-opencl-using-sycl

I deviated these instructions in the following way:

I did not update execute the following steps:

$ sudo apt-get install linux-image-3.19.0-79-generic linux-image-extra-3.19.0-79-generic linux-headers-3.19.0-79-generic 
$ sudo apt-get remove linux-image-4.2.0-42-generic 
$ sudo update-grub -

I was not sure why it is important to go to a that particular kernal so I did not upgrade the kernel. This is the version of Ubuntu I am using:

Distributor ID: Ubuntu Description: Ubuntu 14.04.5 LTS Release: 14.04 Codename: trusty

I am using the following kernel as part of t his standard Ubuntu 14.04.5 built:

3.13.0-116-generic

I used Python 3.5 inside a conda environment instead of Python 2.7

clinfo gives the following info:

Number of platforms:                 1
  Platform Profile:              FULL_PROFILE
  Platform Version:              OpenCL 2.0 AMD-APP (1912.5)
  Platform Name:                 AMD Accelerated Parallel Processing
  Platform Vendor:               Advanced Micro Devices, Inc.
  Platform Extensions:               cl_khr_icd cl_amd_event_callback cl_amd_offline_devices 

  Platform Name:                 AMD Accelerated Parallel Processing
Number of devices:               2
  Device Type:                   CL_DEVICE_TYPE_GPU
  Vendor ID:                     1002h
  Board name:                    AMD Radeon(TM) R7 Graphics  
  Device Topology:               PCI[ B#0, D#1, F#0 ]
  Max compute units:                 8
  Max work items dimensions:             3
    Max work items[0]:               256
    Max work items[1]:               256
    Max work items[2]:               256
  Max work group size:               256
  Preferred vector width char:           4
  Preferred vector width short:          2
  Preferred vector width int:            1
  Preferred vector width long:           1
  Preferred vector width float:          1
  Preferred vector width double:         1
  Native vector width char:          4
  Native vector width short:             2
  Native vector width int:           1
  Native vector width long:          1
  Native vector width float:             1
  Native vector width double:            1
  Max clock frequency:               720Mhz
  Address bits:                  64
  Max memory allocation:             215482368
  Image support:                 Yes
  Max number of images read arguments:       128
  Max number of images write arguments:      64
  Max image 2D width:                16384
  Max image 2D height:               16384
  Max image 3D width:                2048
  Max image 3D height:               2048
  Max image 3D depth:                2048
  Max samplers within kernel:            16
  Max size of kernel argument:           1024
  Alignment (bits) of base address:      2048
  Minimum alignment (bytes) for any datatype:    128
  Single precision floating point capability
    Denorms:                     No
    Quiet NaNs:                  Yes
    Round to nearest even:           Yes
    Round to zero:               Yes
    Round to +ve and infinity:           Yes
    IEEE754-2008 fused multiply-add:         Yes
  Cache type:                    Read/Write
  Cache line size:               64
  Cache size:                    16384
  Global memory size:                861929472
  Constant buffer size:              65536
  Max number of constant args:           8
  Local memory type:                 Scratchpad
  Local memory size:                 32768
  Max pipe arguments:                16
  Max pipe active reservations:          16
  Max pipe packet size:              215482368
  Max global variable size:          193934080
  Max global variable preferred total size:  861929472
  Max read/write image args:             64
  Max on device events:              1024
  Queue on device max size:          8388608
  Max on device queues:              1
  Queue on device preferred size:        262144
  SVM capabilities:              
    Coarse grain buffer:             Yes
    Fine grain buffer:               Yes
    Fine grain system:               No
    Atomics:                     No
  Preferred platform atomic alignment:       0
  Preferred global atomic alignment:         0
  Preferred local atomic alignment:      0
  Kernel Preferred work group size multiple:     64
  Error correction support:          0
  Unified memory for Host and Device:        1
  Profiling timer resolution:            1
  Device endianess:              Little
  Available:                     Yes
  Compiler available:                Yes
  Execution capabilities:                
    Execute OpenCL kernels:          Yes
    Execute native function:             No
  Queue on Host properties:              
    Out-of-Order:                No
    Profiling :                  Yes
  Queue on Device properties:                
    Out-of-Order:                Yes
    Profiling :                  Yes
  Platform ID:                   0x7f1c77535a18
  Name:                      Spectre
  Vendor:                    Advanced Micro Devices, Inc.
  Device OpenCL C version:           OpenCL C 2.0 
  Driver version:                1912.5 (VM)
  Profile:                   FULL_PROFILE
  Version:                   OpenCL 2.0 AMD-APP (1912.5)
  Extensions:                    cl_khr_fp64 cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_gl_sharing cl_khr_gl_depth_images cl_ext_atomic_counters_32 cl_amd_device_attribute_query cl_amd_vec3 cl_amd_printf cl_amd_media_ops cl_amd_media_ops2 cl_amd_popcnt cl_khr_image2d_from_buffer cl_khr_spir cl_khr_subgroups cl_khr_gl_event cl_khr_depth_images cl_khr_mipmap_image cl_khr_mipmap_image_writes 

  Device Type:                   CL_DEVICE_TYPE_CPU
  Vendor ID:                     1002h
  Board name:                    
  Max compute units:                 4
  Max work items dimensions:             3
    Max work items[0]:               1024
    Max work items[1]:               1024
    Max work items[2]:               1024
  Max work group size:               1024
  Preferred vector width char:           16
  Preferred vector width short:          8
  Preferred vector width int:            4
  Preferred vector width long:           2
  Preferred vector width float:          8
  Preferred vector width double:         4
  Native vector width char:          16
  Native vector width short:             8
  Native vector width int:           4
  Native vector width long:          2
  Native vector width float:             8
  Native vector width double:            4
  Max clock frequency:               3700Mhz
  Address bits:                  64
  Max memory allocation:             2147483648
  Image support:                 Yes
  Max number of images read arguments:       128
  Max number of images write arguments:      64
  Max image 2D width:                8192
  Max image 2D height:               8192
  Max image 3D width:                2048
  Max image 3D height:               2048
  Max image 3D depth:                2048
  Max samplers within kernel:            16
  Max size of kernel argument:           4096
  Alignment (bits) of base address:      1024
  Minimum alignment (bytes) for any datatype:    128
  Single precision floating point capability
    Denorms:                     Yes
    Quiet NaNs:                  Yes
    Round to nearest even:           Yes
    Round to zero:               Yes
    Round to +ve and infinity:           Yes
    IEEE754-2008 fused multiply-add:         Yes
  Cache type:                    Read/Write
  Cache line size:               64
  Cache size:                    16384
  Global memory size:                7182524416
  Constant buffer size:              65536
  Max number of constant args:           8
  Local memory type:                 Global
  Local memory size:                 32768
  Max pipe arguments:                16
  Max pipe active reservations:          16
  Max pipe packet size:              2147483648
  Max global variable size:          1879048192
  Max global variable preferred total size:  1879048192
  Max read/write image args:             64
  Max on device events:              0
  Queue on device max size:          0
  Max on device queues:              0
  Queue on device preferred size:        0
  SVM capabilities:              
    Coarse grain buffer:             No
    Fine grain buffer:               No
    Fine grain system:               No
    Atomics:                     No
  Preferred platform atomic alignment:       0
  Preferred global atomic alignment:         0
  Preferred local atomic alignment:      0
  Kernel Preferred work group size multiple:     1
  Error correction support:          0
  Unified memory for Host and Device:        1
  Profiling timer resolution:            1
  Device endianess:              Little
  Available:                     Yes
  Compiler available:                Yes
  Execution capabilities:                
    Execute OpenCL kernels:          Yes
    Execute native function:             Yes
  Queue on Host properties:              
    Out-of-Order:                No
    Profiling :                  Yes
  Queue on Device properties:                
    Out-of-Order:                No
    Profiling :                  No
  Platform ID:                   0x7f1c77535a18
  Name:                      AMD A10-7850K Radeon R7, 12 Compute Cores 4C+8G
  Vendor:                    AuthenticAMD
  Device OpenCL C version:           OpenCL C 1.2 
  Driver version:                1912.5 (sse2,avx,fma4)
  Profile:                   FULL_PROFILE
  Version:                   OpenCL 1.2 AMD-APP (1912.5)
  Extensions:                    cl_khr_fp64 cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_gl_sharing cl_ext_device_fission cl_amd_device_attribute_query cl_amd_vec3 cl_amd_printf cl_amd_media_ops cl_amd_media_ops2 cl_amd_popcnt cl_khr_spir cl_khr_gl_event

/usr/local/computecpp/bin/computecpp_info gives the following output

********************************************************************************

ComputeCpp Info (CE 0.1.3)

********************************************************************************

Toolchain information:

GLIBCXX: 20150426
This version of libstdc++ is supported.

********************************************************************************

Device Info:

Discovered 1 devices matching:
  platform    : <any>
  device type : <any>

--------------------------------------------------------------------------------
Device 0:

  Device is supported                     : UNTESTED - Device not tested on this OS
  CL_DEVICE_NAME                          : Spectre
  CL_DEVICE_VENDOR                        : Advanced Micro Devices, Inc.
  CL_DRIVER_VERSION                       : 1912.5 (VM)
  CL_DEVICE_TYPE                          : CL_DEVICE_TYPE_GPU 
********************************************************************************

********************************************************************************

********************************************************************************

I note here that somehow the CPU was not detected which is different from the tutorial mentioned above.

After configuring with default options, I run the following command:

$ bazel build -c opt --copt=-mavx --copt=-msse4.1 --copt=-msse4.2 --config=sycl //tensorflow/tools/pip_package:build_pip_package --verbose_failures

I am encountering the following error:

INFO: Found 1 target...
INFO: From Executing genrule //tensorflow/cc:array_ops_genrule:
2017-04-18 22:42:10.696714: W tensorflow/core/framework/op_gen_lib.cc:194] Squeeze can't find input squeeze_dims to rename
ERROR: /home/anthonyle/Projects/tensorflow-opencl/tensorflow/core/kernels/BUILD:2616:1: C++ compilation of rule '//tensorflow/core/kernels:pooling_ops' failed: computecpp failed: error executing command 
  (cd /home/anthonyle/.cache/bazel/_bazel_anthonyle/1b4b305bac04d7a568c973de167c2cf3/execroot/tensorflow-opencl && \
  exec env - \
  external/local_config_sycl/crosstool/computecpp -Wall -msse3 -g0 -O2 -DNDEBUG -mavx -msse4.1 -msse4.2 '-std=c++11' -MD -MF bazel-out/local_linux-py3-opt/bin/tensorflow/core/kernels/_objs/pooling_ops/tensorflow/core/kernels/pooling_ops_3d.pic.d '-frandom-seed=bazel-out/local_linux-py3-opt/bin/tensorflow/core/kernels/_objs/pooling_ops/tensorflow/core/kernels/pooling_ops_3d.pic.o' -fPIC -DEIGEN_MPL2_ONLY -DTENSORFLOW_USE_JEMALLOC -iquote . -iquote bazel-out/local_linux-py3-opt/genfiles -iquote external/eigen_archive -iquote bazel-out/local_linux-py3-opt/genfiles/external/eigen_archive -iquote external/bazel_tools -iquote bazel-out/local_linux-py3-opt/genfiles/external/bazel_tools -iquote external/local_config_sycl -iquote bazel-out/local_linux-py3-opt/genfiles/external/local_config_sycl -iquote external/jemalloc -iquote bazel-out/local_linux-py3-opt/genfiles/external/jemalloc -iquote external/protobuf -iquote bazel-out/local_linux-py3-opt/genfiles/external/protobuf -iquote external/gif_archive -iquote bazel-out/local_linux-py3-opt/genfiles/external/gif_archive -iquote external/jpeg -iquote bazel-out/local_linux-py3-opt/genfiles/external/jpeg -iquote external/com_googlesource_code_re2 -iquote bazel-out/local_linux-py3-opt/genfiles/external/com_googlesource_code_re2 -iquote external/farmhash_archive -iquote bazel-out/local_linux-py3-opt/genfiles/external/farmhash_archive -iquote external/highwayhash -iquote bazel-out/local_linux-py3-opt/genfiles/external/highwayhash -iquote external/png_archive -iquote bazel-out/local_linux-py3-opt/genfiles/external/png_archive -iquote external/zlib_archive -iquote bazel-out/local_linux-py3-opt/genfiles/external/zlib_archive -isystem external/eigen_archive -isystem bazel-out/local_linux-py3-opt/genfiles/external/eigen_archive -isystem external/bazel_tools/tools/cpp/gcc3 -isystem external/local_config_sycl/sycl -isystem bazel-out/local_linux-py3-opt/genfiles/external/local_config_sycl/sycl -isystem external/local_config_sycl/sycl/include -isystem bazel-out/local_linux-py3-opt/genfiles/external/local_config_sycl/sycl/include -isystem external/jemalloc/include -isystem bazel-out/local_linux-py3-opt/genfiles/external/jemalloc/include -isystem external/protobuf/src -isystem bazel-out/local_linux-py3-opt/genfiles/external/protobuf/src -isystem external/gif_archive/lib -isystem bazel-out/local_linux-py3-opt/genfiles/external/gif_archive/lib -isystem external/farmhash_archive/src -isystem bazel-out/local_linux-py3-opt/genfiles/external/farmhash_archive/src -isystem external/png_archive -isystem bazel-out/local_linux-py3-opt/genfiles/external/png_archive -isystem external/zlib_archive -isystem bazel-out/local_linux-py3-opt/genfiles/external/zlib_archive -DEIGEN_AVOID_STL_ARRAY -Iexternal/gemmlowp -Wno-sign-compare -fno-exceptions -msse3 -pthread -Wno-builtin-macro-redefined '-D__DATE__="redacted"' '-D__TIMESTAMP__="redacted"' '-D__TIME__="redacted"' -c tensorflow/core/kernels/pooling_ops_3d.cc -o bazel-out/local_linux-py3-opt/bin/tensorflow/core/kernels/_objs/pooling_ops/tensorflow/core/kernels/pooling_ops_3d.pic.o): com.google.devtools.build.lib.shell.BadExitStatusException: Process exited with status 1.
In file included from tensorflow/core/kernels/pooling_ops_3d.cc:26:
./tensorflow/core/kernels/eigen_pooling.h:354:9: error: cannot compile this builtin function yet
        pequal(p, pset1<Packet>(-Eigen::NumTraits<T>::highest()));
        ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
./tensorflow/core/kernels/eigen_pooling.h:337:22: note: expanded from macro 'pequal'
#define pequal(a, b) _mm256_cmp_ps(a, b, _CMP_EQ_UQ)
                     ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/usr/local/computecpp/bin/../lib/clang/3.6.0/include/avxintrin.h:421:11: note: expanded from macro '_mm256_cmp_ps'
  (__m256)__builtin_ia32_cmpps256((__v8sf)__a, (__v8sf)__b, (c)); })
          ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1 error generated.
Target //tensorflow/tools/pip_package:build_pip_package failed to build
INFO: Elapsed time: 20.702s, Critical Path: 20.03s

Did skipping some of the steps outlined above really lead to these errors? What did I do wrong?

Zakor94 commented 7 years ago

You are not doing anything wrong. There is an issue with SYCL and the SIMD instructions (i.e. mavx and msee flags). I am running tests right now to see if a simple fix is possible which would allow to keep the SIMD instructions for the CPU. Otherwise the only solution would be to remove these flags. I will keep you in touch.

Also I noticed that you are using clang. You should switch to gcc-4.8 (and g++4.8) at least.

Zakor94 commented 7 years ago

Ok this seems to compile and pass the tests just fine. Make sure you first switch to gcc-4.8 (the link you mentioned needs to be updated). If this still does not work, please try to apply my fix that you can find here: https://github.com/lukeiwanski/tensorflow/commit/fabe385ddc791d1aa7e44685281ba11e029ecf9f

thornhale commented 7 years ago

Thank you. That helped with compiling! Compiling now finishes successfully. However, I am not quite sure how to proceed from here to get a working build within a virtual environment. I tried to create a wheel like so:

bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg

Then, I tried:

pip install /tmp/tensorflow_pkg/NAME_OF_WHEEL.whl

Now when I try to just import Tensorflow in Jupyter Notebook, I am getting the following errors:

---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
/home/anthonyle/anaconda3/envs/deep_learning_gpu3/lib/python3.5/site-packages/tensorflow/python/__init__.py in <module>()
     60     sys.setdlopenflags(_default_dlopen_flags | ctypes.RTLD_GLOBAL)
---> 61     from tensorflow.python import pywrap_tensorflow
     62     sys.setdlopenflags(_default_dlopen_flags)

/home/anthonyle/anaconda3/envs/deep_learning_gpu3/lib/python3.5/site-packages/tensorflow/python/pywrap_tensorflow.py in <module>()
     27             return _mod
---> 28     _pywrap_tensorflow = swig_import_helper()
     29     del swig_import_helper

/home/anthonyle/anaconda3/envs/deep_learning_gpu3/lib/python3.5/site-packages/tensorflow/python/pywrap_tensorflow.py in swig_import_helper()
     23             try:
---> 24                 _mod = imp.load_module('_pywrap_tensorflow', fp, pathname, description)
     25             finally:

/home/anthonyle/anaconda3/envs/deep_learning_gpu3/lib/python3.5/imp.py in load_module(name, file, filename, details)
    241         else:
--> 242             return load_dynamic(name, filename, file)
    243     elif type_ == PKG_DIRECTORY:

/home/anthonyle/anaconda3/envs/deep_learning_gpu3/lib/python3.5/imp.py in load_dynamic(name, path, file)
    341             name=name, loader=loader, origin=path)
--> 342         return _load(spec)
    343 

ImportError: libComputeCpp.so: cannot open shared object file: No such file or directory

During handling of the above exception, another exception occurred:

ImportError                               Traceback (most recent call last)
<ipython-input-1-c61832825467> in <module>()
     28 from sklearn.cross_validation import KFold, StratifiedKFold
     29 from sklearn.model_selection import train_test_split
---> 30 from keras.applications import ResNet50, InceptionV3
     31 from keras.models import Sequential, Model
     32 from keras.layers.core import Dense, Dropout, Flatten

/home/anthonyle/anaconda3/envs/deep_learning_gpu3/lib/python3.5/site-packages/keras/__init__.py in <module>()
      1 from __future__ import absolute_import
      2 
----> 3 from . import activations
      4 from . import applications
      5 from . import backend

/home/anthonyle/anaconda3/envs/deep_learning_gpu3/lib/python3.5/site-packages/keras/activations.py in <module>()
      1 from __future__ import absolute_import
      2 import six
----> 3 from . import backend as K
      4 from .utils.generic_utils import deserialize_keras_object
      5 

/home/anthonyle/anaconda3/envs/deep_learning_gpu3/lib/python3.5/site-packages/keras/backend/__init__.py in <module>()
     71 elif _BACKEND == 'tensorflow':
     72     sys.stderr.write('Using TensorFlow backend.\n')
---> 73     from .tensorflow_backend import *
     74 else:
     75     raise ValueError('Unknown backend: ' + str(_BACKEND))

/home/anthonyle/anaconda3/envs/deep_learning_gpu3/lib/python3.5/site-packages/keras/backend/tensorflow_backend.py in <module>()
----> 1 import tensorflow as tf
      2 from tensorflow.python.training import moving_averages
      3 from tensorflow.python.ops import tensor_array_ops
      4 from tensorflow.python.ops import control_flow_ops
      5 from tensorflow.python.ops import functional_ops

/home/anthonyle/anaconda3/envs/deep_learning_gpu3/lib/python3.5/site-packages/tensorflow/__init__.py in <module>()
     22 
     23 # pylint: disable=wildcard-import
---> 24 from tensorflow.python import *
     25 # pylint: enable=wildcard-import
     26 

/home/anthonyle/anaconda3/envs/deep_learning_gpu3/lib/python3.5/site-packages/tensorflow/python/__init__.py in <module>()
     70 for some common reasons and solutions.  Include the entire stack trace
     71 above this error message when asking for help.""" % traceback.format_exc()
---> 72   raise ImportError(msg)
     73 
     74 # Protocol buffers

ImportError: Traceback (most recent call last):
  File "/home/anthonyle/anaconda3/envs/deep_learning_gpu3/lib/python3.5/site-packages/tensorflow/python/__init__.py", line 61, in <module>
    from tensorflow.python import pywrap_tensorflow
  File "/home/anthonyle/anaconda3/envs/deep_learning_gpu3/lib/python3.5/site-packages/tensorflow/python/pywrap_tensorflow.py", line 28, in <module>
    _pywrap_tensorflow = swig_import_helper()
  File "/home/anthonyle/anaconda3/envs/deep_learning_gpu3/lib/python3.5/site-packages/tensorflow/python/pywrap_tensorflow.py", line 24, in swig_import_helper
    _mod = imp.load_module('_pywrap_tensorflow', fp, pathname, description)
  File "/home/anthonyle/anaconda3/envs/deep_learning_gpu3/lib/python3.5/imp.py", line 242, in load_module
    return load_dynamic(name, filename, file)
  File "/home/anthonyle/anaconda3/envs/deep_learning_gpu3/lib/python3.5/imp.py", line 342, in load_dynamic
    return _load(spec)
ImportError: libComputeCpp.so: cannot open shared object file: No such file or directory

Failed to load the native TensorFlow runtime.

See https://github.com/tensorflow/tensorflow/blob/master/tensorflow/g3doc/get_started/os_setup.md#import_error

for some common reasons and solutions.  Include the entire stack trace
above this error message when asking for help.

Am I too ambitious trying to make a wheel and then do a pip install into my virtual environment? I see that in the tutorial your colleague is actually:

` $ mkdir _python_build

$ cd _python_build

$ ln -s ../bazel-bin/tensorflow/tools/pip_package/build_pip_package.runfiles/org_tensorflow/* .

$ ln -s ../tensorflow/tools/pip_package/* . ` ...creating symbolic links into the _python_build folder. I fail to understand how this will install a python package into a site-package folder...or make tensorflow available to the system. Could you help enlighten me on that?

Zakor94 commented 7 years ago

Ok so this is just some incomplete information. If you have a look here: http://deep-beta.co.uk/setting-up-tensorflow-with-opencl-using-sycl/ you can find a very similar guide. The last command you need to launch from the _python_build directory is python setup.py develop This will actually create an egg-link in the dist-package folder that points to the _python_build folder and acts just like a package installed from pip. Note that if you want to launch python and import tensorflow you will have to be outside the repository project.

Also I wasn't able to create a whl like you tried either. Not sure why.

thornhale commented 7 years ago

Hi Zakor,

I thought it would be wise for me to also run the tests as recommended. I am noticing 2 things:

1.) There is a timeout option set on the tutorial like so: bazel test --config=sycl -k --test_timeout 1600 -- //tensorflow/... -//tensorflow/contrib/... -//tensorflow/java/... -//tensorflow /compiler/...

It's rather high. Are these limits high because the test computations are rather expensive?

2.) A lot of these tests time out on my setup. Is this an indication that the integrated GPU is not quite powerful enough to perform these tests in time? At the writing of these tests, the tests have been running for 2 days now. I expect the tests to finish within one more day. Is this an indication of deeper problems? E.g.: The GPU is not actually used etc.

Update:

After compiling the build without errors, I proceeded to compare some performance:

Tensorflow from pip: 1 epoch = ~ 1,700 sec (CPU utilization ~ 350%) Tensorflow+SSE (4.1+4.2)+AVX+Keras on CIFAR10 dataset: 1 epoch = ~ 1,100 sec (CPU utilization ~ 350%) Tensorflow+OpenCl: 1 epoch = ~ 11,000 sec (CPU utilization ~ 150%)

This is about 10x worse than what I would get with just optimized tensorflow compilation. I am not currently gaining the hoped for performance increases. How can I test if the GPU is used at all?

Zakor94 commented 7 years ago

Hi, I am not sure why we need such a high timeout as well. Two days is definitely too much! I usually only test -- //tensorflow/... -//tensorflow/compiler/... which takes less than an hour. Also try adding --local_test_jobs=8. Even when I ran all the tests I don't remember of any timeouts. Maybe it is as you say because of the integrated GPU that is not powerful enough. To make sure I think you should use an external tool such as aticonfig --odgc --odgt since you have an AMD.

thornhale commented 7 years ago

Thank you for guiding me along:

I have reinstalled tensorflow-opencl because I specified incorrect paths to the computcpp, g++ and g compiler last time. I have also rerun my CIFAR-10 dataset to benchmark tensorflow-opencl with the integrated GPU on my AMD A10-7850 setup. This time I also looked at GPU usage. I was able to verify that the GPU is used at 100% capacity. The CPU usage is still at 150%. The time to process 1 epoch is still about 5800 seconds .

So this is still about 5X worse than just using optimized compilation flags. These are my general observations and thoughts after having tried out tensorflow-opencl:

1.) The fact that on my 4-core setup, the CPU usage is only 150% percent indicates that somehow multithreading is not fully efficient in the opencl setup because without opencl all 4 cores get used. 2.) I am not sure what it is but I thought that any matrix computation in a GPU should be faster if executed there instead of a CPU. I think in the case of opencl 1.2 we are still talking about copying data from CPU-RAM to GPU-RAM before any computations can be done. And if the GPU only has 1 GB of RAM there is potentially a lot of copying back and forth. Could this be the cause for the slowness of the computations? With opencl 2.0, I think one does not have to copy data back and forth, but could just pass pointers. May I ask what the reason is for going with opencl 1.2 instead of opencl 2.0? So the full potential of APUs cannot currently be exploited. 3.) It appears that opencl 1.2 is not a full substitute for handcrafted/optimized assembler libraries at this time.

How can I further increase computation times without NVIDIA GPUs? I still have a few months before I commit. One of the things I am waiting for is the release of the VEGA GPUs. In the absence of any softwareframe work ontop to exploit the GPU potential, it will be hard to go with the VEGA cards though. For one, the talked about rocM and MIopen initiatives have not been released.

In general what are your thoughts?

(Oh and by the way, if you need help with benchmarking on an APU system, now that things are working for me, I would be happy to help out!)

Zakor94 commented 7 years ago

1) When the GPU is enabled, whether it is with CUDA or OpenCL, all the heavy work is done there so the CPU won't be much used. The optimization flags actually don't affect the CPU usage. So it is expected to observe a usage around 100%. 2) I think your VRAM is definitely the bottleneck here as you suspected. For comparison I have 4GB. It is true that a lot of copies happen between the CPU and GPU. Some work still needs to be done to avoid that. Then your question is actually related to SYCL, you can probably have more help here: https://github.com/lukeiwanski/tensorflow.

Well assuming you don't want to spend more money on GPU, the only possibility I see is contributing to this repository ;) There are other optimizations to do to avoid copies.

(ok that's very nice of you ^^)

thornhale commented 7 years ago

Well, I will spend more money on a GPU in about 2-5 months. The question then will be what GPU to get (NVIDIA vs AMD). At this point, the answer is tilting toward NVIDIA. But I really want to give AMD a good chance first. With MIOpen and rocM, and this, the most robust path on AMD GPUs seems less defined.

This probably goes outside the scope of this discussion...in which case, could you point me to lists of optimizations that still need to be done?

Zakor94 commented 7 years ago

Yes this is definitely getting out of scope. Please open an issue on https://github.com/lukeiwanski/tensorflow about the optimizations that can be done to avoid copies.

benoitsteiner / tensorflow-opencl

Unsuccessful Build on A10-7850K, please help! #65