Tensorflow not scaling across multiple cores, slower than pip version.

jmadams1 commented 5 years ago

Actual Behavior

Tensorflow does not seem to scale over two cores. No GPU used.

Few Notes: 1) If there is a different well known example you would like me to test instead, please let me know. The one I've been using for this test may be difficult to setup (but what I've been interested in. Since it works as expected using pip tensorflow, I decided to post the question here.) 2) Everything is out of the box/default. There may be things different about anaconda tensorflow that requires special tunings - I don't know what those are.

Expected Behavior

When running code utilizing tensorflow, it should scale across all cores.

Steps to Reproduce

I've seen this on a few tensorflow examples I've been running. So, I did an experiment with a clean 8 core VM (ubuntu 18.04). I downloaded a well known non trivial example that uses tensorflow: https://github.com/google-research/exoplanet-ml (Downloaded code and pre-computed training data.)

Commands executed to create environment: conda create -n kepler python=3.6 anaconda conda activate kepler conda install tensorflow pandas numpy scipy astropy absl-py conda install -c astropy pydl conda install bazel==0.18.0 conda install -c hcc tensorflow-probability

Followed the install instructions, but used Anaconda packages. Made a change to one file to be compatible with python3 (./astrowavenet/data/base.py) replaced iteritems with items.

When I executed the training example, the code never used more than two cores.

Execution time: real 40m27.855s user 54m59.402s sys 1m4.234s

Did the same experiment, but using all pip versions of the requirements (except bazel, downloaded from bazel site.) Training used all cores.

real 17m55.586s user 107m3.854s sys 13m51.620s

Commands executed: ./bazel-0.18.0-installer-linux-x86_64.sh --user .. sudo apt-get install -y python3-venv python3 -m venv pip_test source pip_test/bin/activate pip install tensorflow pandas numpy scipy astropy pydl absl-py tensorflow_probability

Anaconda or Miniconda version:

2019.3

Operating System:

Ubuntu 18.04.2 64bit (All up to date.)

`conda info`

``` PASTE OUTPUT HERE: active environment : kepler active env location : /home/jmadams/anaconda3/envs/kepler shell level : 2 user config file : /home/jmadams/.condarc populated config files : conda version : 4.6.11 conda-build version : 3.17.8 python version : 3.7.3.final.0 base environment : /home/jmadams/anaconda3 (writable) channel URLs : https://repo.anaconda.com/pkgs/main/linux-64 https://repo.anaconda.com/pkgs/main/noarch https://repo.anaconda.com/pkgs/free/linux-64 https://repo.anaconda.com/pkgs/free/noarch https://repo.anaconda.com/pkgs/r/linux-64 https://repo.anaconda.com/pkgs/r/noarch package cache : /home/jmadams/anaconda3/pkgs /home/jmadams/.conda/pkgs envs directories : /home/jmadams/anaconda3/envs /home/jmadams/.conda/envs platform : linux-64 user-agent : conda/4.6.11 requests/2.21.0 CPython/3.7.3 Linux/4.18.0-17-generic ubuntu/18.04.2 glibc/2.27 UID:GID : 1000:1000 netrc file : None offline mode : False ```

`conda list --show-channel-urls`

``` PASTE OUTPUT HERE: WARNING: The conda.compat module is deprecated and will be removed in a future release. # packages in environment at /home/jmadams/anaconda3/envs/kepler: # # Name Version Build Channel _tflow_select 2.3.0 mkl defaults absl-py 0.7.0 py36_0 defaults alabaster 0.7.12 py36_0 defaults anaconda custom py36hbbc8b67_0 defaults anaconda-client 1.7.2 py36_0 defaults anaconda-project 0.8.2 py36_0 defaults asn1crypto 0.24.0 py36_0 defaults astor 0.7.1 py36_0 defaults astroid 2.2.5 py36_0 defaults astropy 3.1.2 py36h7b6447c_0 defaults atomicwrites 1.3.0 py36_1 defaults attrs 19.1.0 py36_1 defaults babel 2.6.0 py36_0 defaults backcall 0.1.0 py36_0 defaults backports 1.0 py36_1 defaults backports.os 0.1.1 py36_0 defaults backports.shutil_get_terminal_size 1.0.0 py36_2 defaults bazel 0.18.0 he6710b0_0 defaults beautifulsoup4 4.7.1 py36_1 defaults bitarray 0.8.3 py36h14c3975_0 defaults bkcharts 0.2 py36h735825a_0 defaults blas 1.0 mkl defaults bleach 3.1.0 py36_0 defaults blosc 1.15.0 hd408876_0 defaults bokeh 1.0.4 py36_0 defaults boto 2.49.0 py36_0 defaults bottleneck 1.2.1 py36h035aef0_1 defaults bzip2 1.0.6 h14c3975_5 defaults c-ares 1.15.0 h7b6447c_1 defaults ca-certificates 2019.1.23 0 defaults cairo 1.14.12 h8948797_3 defaults certifi 2019.3.9 py36_0 defaults cffi 1.12.2 py36h2e261b9_1 defaults chardet 3.0.4 py36_1 defaults click 7.0 py36_0 defaults cloudpickle 0.8.0 py36_0 defaults clyent 1.2.2 py36_1 defaults colorama 0.4.1 py36_0 defaults contextlib2 0.5.5 py36h6c84a62_0 defaults cryptography 2.6.1 py36h1ba5d50_0 defaults curl 7.64.0 hbc83047_2 defaults cycler 0.10.0 py36h93f1223_0 defaults cython 0.29.6 py36he6710b0_0 defaults cytoolz 0.9.0.1 py36h14c3975_1 defaults dask 1.1.4 py36_1 defaults dask-core 1.1.4 py36_1 defaults dbus 1.13.6 h746ee38_0 defaults decorator 4.4.0 py36_1 defaults defusedxml 0.5.0 py36_1 defaults distributed 1.26.0 py36_1 defaults docutils 0.14 py36hb0f60f5_0 defaults entrypoints 0.3 py36_0 defaults et_xmlfile 1.0.1 py36hd6bccc3_0 defaults expat 2.2.6 he6710b0_0 defaults fastcache 1.0.2 py36h14c3975_2 defaults flask 1.0.2 py36_1 defaults fontconfig 2.13.0 h9420a91_0 defaults freetype 2.9.1 h8a8886c_1 defaults fribidi 1.0.5 h7b6447c_0 defaults gast 0.2.2 py36_0 defaults get_terminal_size 1.0.0 haa9412d_0 defaults gevent 1.4.0 py36h7b6447c_0 defaults glib 2.56.2 hd408876_0 defaults gmp 6.1.2 h6c8ec71_1 defaults gmpy2 2.0.8 py36h10f8cd9_2 defaults graphite2 1.3.13 h23475e2_0 defaults greenlet 0.4.15 py36h7b6447c_0 defaults grpcio 1.16.1 py36hf8bcb03_1 defaults gst-plugins-base 1.14.0 hbbd80ab_1 defaults gstreamer 1.14.0 hb453b48_1 defaults h5py 2.9.0 py36h7918eee_0 defaults harfbuzz 1.8.8 hffaf4a1_0 defaults hdf5 1.10.4 hb1b8bf9_0 defaults heapdict 1.0.0 py36_2 defaults html5lib 1.0.1 py36_0 defaults icu 58.2 h9c2bf20_1 defaults idna 2.8 py36_0 defaults imageio 2.5.0 py36_0 defaults imagesize 1.1.0 py36_0 defaults importlib_metadata 0.8 py36_0 defaults intel-openmp 2019.3 199 defaults ipykernel 5.1.0 py36h39e3cac_0 defaults ipython 7.4.0 py36h39e3cac_0 defaults ipython_genutils 0.2.0 py36hb52b0d5_0 defaults ipywidgets 7.4.2 py36_0 defaults isort 4.3.16 py36_0 defaults itsdangerous 1.1.0 py36_0 defaults jbig 2.1 hdba287a_0 defaults jdcal 1.4 py36_0 defaults jedi 0.13.3 py36_0 defaults jeepney 0.4 py36_0 defaults jinja2 2.10 py36_0 defaults jpeg 9b h024ee3a_2 defaults jsonschema 3.0.1 py36_0 defaults jupyter 1.0.0 py36_7 defaults jupyter_client 5.2.4 py36_0 defaults jupyter_console 6.0.0 py36_0 defaults jupyter_core 4.4.0 py36_0 defaults jupyterlab 0.35.4 py36hf63ae98_0 defaults jupyterlab_server 0.2.0 py36_0 defaults keras-applications 1.0.7 py_0 defaults keras-preprocessing 1.0.9 py_0 defaults keyring 18.0.0 py36_0 defaults kiwisolver 1.0.1 py36hf484d3e_0 defaults krb5 1.16.1 h173b8e3_7 defaults lazy-object-proxy 1.3.1 py36h14c3975_2 defaults libcurl 7.64.0 h20c2e04_2 defaults libedit 3.1.20181209 hc058e9b_0 defaults libffi 3.2.1 hd88cf55_4 defaults libgcc-ng 8.2.0 hdf63c60_1 defaults libgfortran-ng 7.3.0 hdf63c60_0 defaults libpng 1.6.36 hbc83047_0 defaults libprotobuf 3.6.1 hd408876_0 defaults libsodium 1.0.16 h1bed415_0 defaults libssh2 1.8.0 h1ba5d50_4 defaults libstdcxx-ng 8.2.0 hdf63c60_1 defaults libtiff 4.0.10 h2733197_2 defaults libtool 2.4.6 h7b6447c_5 defaults libuuid 1.0.3 h1bed415_2 defaults libxcb 1.13 h1bed415_1 defaults libxml2 2.9.9 he19cac6_0 defaults libxslt 1.1.33 h7d1a2b0_0 defaults llvmlite 0.28.0 py36hd408876_0 defaults locket 0.2.0 py36h787c0ad_1 defaults lxml 4.3.2 py36hefd8a0e_0 defaults lzo 2.10 h49e0be7_2 defaults markdown 3.0.1 py36_0 defaults markupsafe 1.1.1 py36h7b6447c_0 defaults matplotlib 3.0.3 py36h5429711_0 defaults mccabe 0.6.1 py36_1 defaults mistune 0.8.4 py36h7b6447c_0 defaults mkl 2019.3 199 defaults mkl-service 1.1.2 py36he904b0f_5 defaults mkl_fft 1.0.10 py36ha843d7b_0 defaults mkl_random 1.0.2 py36hd81dba3_0 defaults mock 2.0.0 py36_0 defaults more-itertools 6.0.0 py36_0 defaults mpc 1.1.0 h10f8cd9_1 defaults mpfr 4.0.1 hdf1c602_3 defaults mpmath 1.1.0 py36_0 defaults msgpack-python 0.6.1 py36hfd86e86_1 defaults multipledispatch 0.6.0 py36_0 defaults nbconvert 5.4.1 py36_3 defaults nbformat 4.4.0 py36h31c9010_0 defaults ncurses 6.1 he6710b0_1 defaults networkx 2.2 py36_1 defaults nltk 3.4 py36_1 defaults nose 1.3.7 py36_2 defaults notebook 5.7.8 py36_0 defaults numba 0.43.1 py36h962f231_0 defaults numexpr 2.6.9 py36h9e4a6bb_0 defaults numpy 1.15.4 py36h7e9f1db_0 defaults numpy-base 1.15.4 py36hde5b4d6_0 defaults numpydoc 0.8.0 py36_0 defaults olefile 0.46 py36_0 defaults openjdk 8.0.152 h46b5887_1 defaults openpyxl 2.6.1 py36_1 defaults openssl 1.1.1b h7b6447c_1 defaults packaging 19.0 py36_0 defaults pandas 0.24.2 py36he6710b0_0 defaults pandoc 2.2.3.2 0 defaults pandocfilters 1.4.2 py36_1 defaults pango 1.42.4 h049681c_0 defaults parso 0.3.4 py36_0 defaults partd 0.3.10 py36_1 defaults path.py 11.5.0 py36_0 defaults pathlib2 2.3.3 py36_0 defaults patsy 0.5.1 py36_0 defaults pbr 5.1.3 py_0 defaults pcre 8.43 he6710b0_0 defaults pep8 1.7.1 py36_0 defaults pexpect 4.6.0 py36_0 defaults pickleshare 0.7.5 py36_0 defaults pillow 5.4.1 py36h34e0f95_0 defaults pip 19.0.3 py36_0 defaults pixman 0.38.0 h7b6447c_0 defaults pluggy 0.9.0 py36_0 defaults ply 3.11 py36_0 defaults prometheus_client 0.6.0 py36_0 defaults prompt_toolkit 2.0.9 py36_0 defaults protobuf 3.6.1 py36he6710b0_0 defaults psutil 5.6.1 py36h7b6447c_0 defaults ptyprocess 0.6.0 py36_0 defaults py 1.8.0 py36_0 defaults pycodestyle 2.5.0 py36_0 defaults pycosat 0.6.3 py36h14c3975_0 defaults pycparser 2.19 py36_0 defaults pycrypto 2.6.1 py36h14c3975_9 defaults pycurl 7.43.0.2 py36h1ba5d50_0 defaults pydl 0.7.0 py36_0 astropy pyflakes 2.1.1 py36_0 defaults pygments 2.3.1 py36_0 defaults pylint 2.3.1 py36_0 defaults pyodbc 4.0.26 py36he6710b0_0 defaults pyopenssl 19.0.0 py36_0 defaults pyparsing 2.3.1 py36_0 defaults pyqt 5.9.2 py36h05f1152_2 defaults pyrsistent 0.14.11 py36h7b6447c_0 defaults pysocks 1.6.8 py36_0 defaults pytables 3.5.1 py36h71ec239_0 defaults pytest 4.3.1 py36_0 defaults pytest-arraydiff 0.3 py36h39e3cac_0 defaults pytest-astropy 0.5.0 py36_0 defaults pytest-doctestplus 0.3.0 py36_0 defaults pytest-openfiles 0.3.2 py36_0 defaults pytest-remotedata 0.3.1 py36_0 defaults python 3.6.8 h0371630_0 defaults python-dateutil 2.8.0 py36_0 defaults pytz 2018.9 py36_0 defaults pywavelets 1.0.2 py36hdd07704_0 defaults pyyaml 5.1 py36h7b6447c_0 defaults pyzmq 18.0.0 py36he6710b0_0 defaults qt 5.9.7 h5867ecd_1 defaults qtawesome 0.5.7 py36_1 defaults qtconsole 4.4.3 py36_0 defaults qtpy 1.7.0 py36_1 defaults readline 7.0 h7b6447c_5 defaults requests 2.21.0 py36_0 defaults rope 0.12.0 py36_0 defaults ruamel_yaml 0.15.46 py36h14c3975_0 defaults scikit-image 0.14.2 py36he6710b0_0 defaults scikit-learn 0.20.3 py36hd81dba3_0 defaults scipy 1.2.1 py36h7c811a0_0 defaults seaborn 0.9.0 py36_0 defaults secretstorage 3.1.1 py36_0 defaults send2trash 1.5.0 py36_0 defaults setuptools 40.8.0 py36_0 defaults simplegeneric 0.8.1 py36_2 defaults singledispatch 3.4.0.3 py36h7a266c3_0 defaults sip 4.19.8 py36hf484d3e_0 defaults six 1.11.0 py36_1 defaults snappy 1.1.7 hbae5bb6_3 defaults snowballstemmer 1.2.1 py36h6febd40_0 defaults sortedcollections 1.1.2 py36_0 defaults sortedcontainers 2.1.0 py36_0 defaults soupsieve 1.8 py36_0 defaults sphinx 1.8.5 py36_0 defaults sphinxcontrib 1.0 py36_1 defaults sphinxcontrib-websupport 1.1.0 py36_1 defaults spyder 3.3.3 py36_0 defaults spyder-kernels 0.4.2 py36_0 defaults sqlalchemy 1.3.1 py36h7b6447c_0 defaults sqlite 3.27.2 h7b6447c_0 defaults statsmodels 0.9.0 py36h035aef0_0 defaults sympy 1.3 py36_0 defaults tblib 1.3.2 py36h34cf8b6_0 defaults tensorboard 1.13.1 py36hf484d3e_0 defaults tensorflow 1.13.1 mkl_py36h27d456a_0 defaults tensorflow-base 1.13.1 mkl_py36h7ce6ba3_0 defaults tensorflow-estimator 1.13.0 py_0 defaults tensorflow-probability 0.5.0 py36h1987d90_0 hcc termcolor 1.1.0 py36_1 defaults terminado 0.8.1 py36_1 defaults testpath 0.4.2 py36_0 defaults tk 8.6.8 hbc83047_0 defaults toolz 0.9.0 py36_0 defaults tornado 6.0.2 py36h7b6447c_0 defaults traitlets 4.3.2 py36h674d592_0 defaults typed-ast 1.3.1 py36h7b6447c_0 defaults unicodecsv 0.14.1 py36ha668878_0 defaults unixodbc 2.3.7 h14c3975_0 defaults urllib3 1.24.1 py36_0 defaults wcwidth 0.1.7 py36hdf4376a_0 defaults webencodings 0.5.1 py36_1 defaults werkzeug 0.14.1 py36_0 defaults wheel 0.33.1 py36_0 defaults widgetsnbextension 3.4.2 py36_0 defaults wrapt 1.11.1 py36h7b6447c_0 defaults wurlitzer 1.0.2 py36_0 defaults xlrd 1.2.0 py36_0 defaults xlsxwriter 1.1.5 py36_0 defaults xlwt 1.3.0 py36h7b00a1f_0 defaults xz 5.2.4 h14c3975_4 defaults yaml 0.1.7 had09818_2 defaults zeromq 4.3.1 he6710b0_3 defaults zict 0.1.4 py36_0 defaults zipp 0.3.3 py36_1 defaults zlib 1.2.11 h7b6447c_3 defaults zstd 1.3.7 h0b5b093_0 defaults ```

jjhelmus commented 5 years ago

@jmadams1 The default tensorflow package installed when you run conda install tensorflow is build with MKL-DNN support. This should provide improved performance on CPU based workflow but uses different environment variables to tune for optimal performance than variant without MKL-DNN . The Tensorflow documentation on the topic discusses these variables and has recommendations on how these should be set. It is possible that the exoplanet-ml benchmark is setting some of these variables to non-optimal values.

I've tested the Tensorflow benchmarks with the default tensorflow conda package and get better performance compared to the pip installed package. Testing these benchmarks to see if you see poor scaling would be a good test.

If you still see issue the tensorflow-eigen package is build without MKL-DNN. Testing with this package may provide some insights.

jmadams1 commented 5 years ago

I'm now working through the following example: https://github.com/tensorflow/models/tree/master/official/resnet

I'm not done with all the testing yet, but I'm getting the same results so far. I've been running the examples in vmware player on an AMD Ryzen 1800x and 2700x. It might be that the Intel optimized tensorflow included with Anaconda acts differently on AMD systems. As part of my testing, I will move the VM to an Intel based system and re-execute the tests. I'll post a followup within the next few days with final results.

jjhelmus commented 5 years ago

@jmadams1 Let me know how the benchmarks go. I am curious as I have not looked much at the performance of the MKL-DNN Tensorflow on AMD systems. It would be interesting to know if the results are different if run on bare-metal rather than in a VM. The VM may interfere with the ability to detect the CPU extensions available which could result in slow code paths being selected.

jmadams1 commented 5 years ago

Summary of tests: In the most recent versions of Anaconda, out of the box Anaconda Tensorflow+Python doesn't scale across multiple cores well like Google's and Intel's releases. It appears to have begun in Anaconda 3.6 and persists in Anaconda 3.7. Anaconda 3.5 appears to function as expected. No platform specific issue found between Intel and AMD when running Anaconda.

I changed the test from resnet to mnist (https://github.com/tensorflow/models/tree/master/official/mnist), since mnist has no external dependencies except for requests. It also runs much faster, so I could complete test runs in a reasonable length of time. I moved testing to a physical Intel system, (Intel(R) Xeon(R) CPU E5-1650 v2 @ 3.50GHz,) to eliminate the AMD question. It's a 6 core, 12 thread workstation running Ubuntu 16.04.

Please take a look at the results below. The issue happens in Anaconda Python 3.6 and 3.7. It seems to run as intended on Anaconda Tensorflow & Anaconda Python 3.5, although it isn't faster than plain tensorflow. In Anaconda python 3.6 & 3.7, it takes much longer to provide a result. In absolute cpu time, Anaconda python 3.6 and 3.7 is more efficient, using less user and system time - but I don't think that's what you're after unless you're optimizing for dual core laptops. After the Intel results, I also included Ryzen. I've attached CPU graphs to illustrate the difference in CPU utilization for each run (please note the legend, the order of tests was not the same between the two systems.)

Pip Installed Tensorflow, OS based Python 3.5: real 35m45.979s user 284m3.763s sys 15m13.145s

Pip Installed Intel_Tensorflow, OS based Python 3.5: real 59m31.725s user 435m10.009s sys 183m43.987s

Anaconda Tensorflow, Anaconda Python 3.5: real 39m4.598s user 295m39.062s sys 39m8.744s

Anaconda Tensorflow, Anaconda Python 3.6: real 96m40.645s user 154m0.565s sys 1m16.992s

Anaconda Tensorflow, Anaconda Python 3.7: real 101m3.908s user 155m27.237s sys 4m24.360s

As a comparison, I also ran on Ubuntu 18.04 using a Ryzen 2700x, 8 cpu cores, 16 threads. OS based Python is version 3.6.

Pip installed Tensorflow, OS based Python 3.6: real 33m24.808s user 358m38.235s sys 13m35.111s

*Pip Installed Intel_Tensorflow, OS based python 3.6: real 56m43.107s user 350m38.242s sys 485m55.180s

Anaconda Tensorflow, Anaconda Python 3.5: real 28m18.713s user 247m37.472s sys 73m25.166s

Anaconda Tensorflow, Anaconda Python 3.6: real 69m56.112s user 100m1.557s sys 0m27.783s

Anaconda Tensorflow, Anaconda Python 3.7: real 70m6.803s user 97m47.096s sys 0m27.165s

*Pip Installed Intel_Tensorflow mis-identified the number of cores and ran double the amount of threads per core than it did on the Intel system.

Intel_CPU Ryzen_CPU

jmadams1 commented 5 years ago

Ping. Any thoughts on this?

Thanks, John

mbauroth01 commented 5 years ago

Funnywise I came to this topic from another suggestion using tensorflow-mkl from conda over pip. Same results as TO ... only 33% CPU usage on all 4 cores (8 threads) with tensorflow-mkl upto 100% CPU usage on all 4 cores (8 threads) with tensorflow-eigen

The time to complete the learning of the model took half(!) the time with tensorflow-eigen! I use latest miniconda 64bit with python 3.7

janosh commented 4 years ago

Same here. Just followed the advice here and here and ended up with a TF that's much slower than the pip-installed version.

jjhelmus commented 4 years ago

When using the MKL variant of Tensorflow it may be necessary to set some environment variables for best performance. The Tensowflow 1.x guide has a section on this topic. From that guide the key environment variables are: KMP_BLOCKTIME=0 and KMP_AFFINITY=granularity=fine,verbose,compact,1,0 with the inter_op_parallelism_threads config attribute set to the number of physical CPUs.

mehdirezaie commented 4 years ago

Same here. Tensorflow installed with Conda is slower than (3x) the one installed with pip.

ContinuumIO / anaconda-issues