Tensorflow not scaling across multiple cores, slower than pip version. #10853

jmadams1 opened 5 years ago

jmadams1 commented 5 years ago

Actual Behavior

Tensorflow does not seem to scale over two cores. No GPU used.

Few Notes: 1) If there is a different well known example you would like me to test instead, please let me know. The one I've been using for this test may be difficult to setup (but what I've been interested in. Since it works as expected using pip tensorflow, I decided to post the question here.) 2) Everything is out of the box/default. There may be things different about anaconda tensorflow that requires special tunings - I don't know what those are.

Expected Behavior

When running code utilizing tensorflow, it should scale across all cores.

Steps to Reproduce

I've seen this on a few tensorflow examples I've been running. So, I did an experiment with a clean 8 core VM (ubuntu 18.04). I downloaded a well known non trivial example that uses tensorflow: (Downloaded code and pre-computed training data.)

Commands executed to create environment: conda create -n kepler python=3.6 anaconda conda activate kepler conda install tensorflow pandas numpy scipy astropy absl-py conda install -c astropy pydl conda install bazel==0.18.0 conda install -c hcc tensorflow-probability

Followed the install instructions, but used Anaconda packages. Made a change to one file to be compatible with python3 (./astrowavenet/data/ replaced iteritems with items.

When I executed the training example, the code never used more than two cores.

Execution time: real 40m27.855s user 54m59.402s sys 1m4.234s

Did the same experiment, but using all pip versions of the requirements (except bazel, downloaded from bazel site.) Training used all cores.

real 17m55.586s user 107m3.854s sys 13m51.620s

Commands executed: ./ --user .. sudo apt-get install -y python3-venv python3 -m venv pip_test source pip_test/bin/activate pip install tensorflow pandas numpy scipy astropy pydl absl-py tensorflow_probability

Anaconda or Miniconda version:


Operating System:

Ubuntu 18.04.2 64bit (All up to date.)

conda info
jjhelmus commented 5 years ago

@jmadams1 The default tensorflow package installed when you run conda install tensorflow is build with MKL-DNN support. This should provide improved performance on CPU based workflow but uses different environment variables to tune for optimal performance than variant without MKL-DNN . The Tensorflow documentation on the topic discusses these variables and has recommendations on how these should be set. It is possible that the exoplanet-ml benchmark is setting some of these variables to non-optimal values.

I've tested the Tensorflow benchmarks with the default tensorflow conda package and get better performance compared to the pip installed package. Testing these benchmarks to see if you see poor scaling would be a good test.

If you still see issue the tensorflow-eigen package is build without MKL-DNN. Testing with this package may provide some insights.

jmadams1 commented 5 years ago

I'm now working through the following example:

I'm not done with all the testing yet, but I'm getting the same results so far. I've been running the examples in vmware player on an AMD Ryzen 1800x and 2700x. It might be that the Intel optimized tensorflow included with Anaconda acts differently on AMD systems. As part of my testing, I will move the VM to an Intel based system and re-execute the tests. I'll post a followup within the next few days with final results.

jjhelmus commented 5 years ago

@jmadams1 Let me know how the benchmarks go. I am curious as I have not looked much at the performance of the MKL-DNN Tensorflow on AMD systems. It would be interesting to know if the results are different if run on bare-metal rather than in a VM. The VM may interfere with the ability to detect the CPU extensions available which could result in slow code paths being selected.

jmadams1 commented 5 years ago

Summary of tests: In the most recent versions of Anaconda, out of the box Anaconda Tensorflow+Python doesn't scale across multiple cores well like Google's and Intel's releases. It appears to have begun in Anaconda 3.6 and persists in Anaconda 3.7. Anaconda 3.5 appears to function as expected. No platform specific issue found between Intel and AMD when running Anaconda.

I changed the test from resnet to mnist (, since mnist has no external dependencies except for requests. It also runs much faster, so I could complete test runs in a reasonable length of time. I moved testing to a physical Intel system, (Intel(R) Xeon(R) CPU E5-1650 v2 @ 3.50GHz,) to eliminate the AMD question. It's a 6 core, 12 thread workstation running Ubuntu 16.04.

Please take a look at the results below. The issue happens in Anaconda Python 3.6 and 3.7. It seems to run as intended on Anaconda Tensorflow & Anaconda Python 3.5, although it isn't faster than plain tensorflow. In Anaconda python 3.6 & 3.7, it takes much longer to provide a result. In absolute cpu time, Anaconda python 3.6 and 3.7 is more efficient, using less user and system time - but I don't think that's what you're after unless you're optimizing for dual core laptops. After the Intel results, I also included Ryzen. I've attached CPU graphs to illustrate the difference in CPU utilization for each run (please note the legend, the order of tests was not the same between the two systems.)

Pip Installed Tensorflow, OS based Python 3.5: real 35m45.979s user 284m3.763s sys 15m13.145s

Pip Installed Intel_Tensorflow, OS based Python 3.5: real 59m31.725s user 435m10.009s sys 183m43.987s

Anaconda Tensorflow, Anaconda Python 3.5: real 39m4.598s user 295m39.062s sys 39m8.744s

Anaconda Tensorflow, Anaconda Python 3.6: real 96m40.645s user 154m0.565s sys 1m16.992s

Anaconda Tensorflow, Anaconda Python 3.7: real 101m3.908s user 155m27.237s sys 4m24.360s

As a comparison, I also ran on Ubuntu 18.04 using a Ryzen 2700x, 8 cpu cores, 16 threads. OS based Python is version 3.6.

Pip installed Tensorflow, OS based Python 3.6: real 33m24.808s user 358m38.235s sys 13m35.111s

*Pip Installed Intel_Tensorflow, OS based python 3.6: real 56m43.107s user 350m38.242s sys 485m55.180s

Anaconda Tensorflow, Anaconda Python 3.5: real 28m18.713s user 247m37.472s sys 73m25.166s

Anaconda Tensorflow, Anaconda Python 3.6: real 69m56.112s user 100m1.557s sys 0m27.783s

Anaconda Tensorflow, Anaconda Python 3.7: real 70m6.803s user 97m47.096s sys 0m27.165s

*Pip Installed Intel_Tensorflow mis-identified the number of cores and ran double the amount of threads per core than it did on the Intel system.

Intel_CPU Ryzen_CPU

jmadams1 commented 5 years ago

Ping. Any thoughts on this?

Thanks, John

mbauroth01 commented 5 years ago

Funnywise I came to this topic from another suggestion using tensorflow-mkl from conda over pip. Same results as TO ... only 33% CPU usage on all 4 cores (8 threads) with tensorflow-mkl upto 100% CPU usage on all 4 cores (8 threads) with tensorflow-eigen

The time to complete the learning of the model took half(!) the time with tensorflow-eigen! I use latest miniconda 64bit with python 3.7

janosh commented 4 years ago

Same here. Just followed the advice here and here and ended up with a TF that's much slower than the pip-installed version.

jjhelmus commented 4 years ago

When using the MKL variant of Tensorflow it may be necessary to set some environment variables for best performance. The Tensowflow 1.x guide has a section on this topic. From that guide the key environment variables are: KMP_BLOCKTIME=0 and KMP_AFFINITY=granularity=fine,verbose,compact,1,0 with the inter_op_parallelism_threads config attribute set to the number of physical CPUs.

mehdirezaie commented 4 years ago

Same here. Tensorflow installed with Conda is slower than (3x) the one installed with pip.