fo40225 / tensorflow-windows-wheel

Tensorflow prebuilt binary for Windows
3.66k stars 1.53k forks source link

Performance improvement is not achieved with AVX #8

Open sharon-k opened 6 years ago

sharon-k commented 6 years ago

My processor- Intel(R) Core(TM) i7-3740QM supports the AVX instructions set. I created 2 environments with Anaconda 4.5.0:

I ran the same code, evaluating a simple network with two fully connected layers, on each of the envs. I couldn't see the time improvement between the two. I will continue to say that it comes after I try it on a complicated network with few conv layers, the improvement wasn't seen there too.

Did I miss something? Thank you for your help

fo40225 commented 6 years ago

Indeed.

Currently using cmake to build Tensorflow on windows, its SIMD configuration has some issue that did not apply to sub module.

You can use this script to compare the speed.

https://github.com/tensorflow/benchmarks/tree/master/scripts/tf_cnn_benchmarks

In my observation, there is only a slight difference in the results of using AVX or not on windows, and did not reach the improvement in the following table.

https://www.tensorflow.org/performance/performance_guide#comparing_compiler_optimizations

sharon-k commented 6 years ago

Thank you for your reply!

Are there plans to address this issue with cmake? Is there an open bug report? If not, could you shortly describe it - maybe I'd be able to help... :)

fo40225 commented 6 years ago

https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/cmake/external

Those cmake files didn't pass tensorflow_WIN_CPU_SIMD_OPTIONS to build the library.

fo40225 commented 6 years ago

https://github.com/fo40225/tensorflow/commit/0a95e35a1476278e834b921e06d159a4ae7d64c1

I'm trying to build with those change, but still no performance difference from sse2 to avx2.

Seem this issue is caused by another reason.

TheRedMudder commented 6 years ago

In my testing, I found that GPU with no AVX2 for the CPU out performed the GPU using AVX2 in speed! I thought AVX2 will give performance gains for operations that only have CPU implementations, but in my test AVX2 marginally decreased performance when the GPU flag was used. Of course AVX2 did improve performance when no GPU flag was used. Have you seen similar results?

Benchmark Test

NO GPU + AVX2 -3rd Place NO GPU + NoAVX2 - 4th Place
AVX No AVX
GPU + AVX2 -2nd Place GPU + No AVX2 - 1st Place
AVX No AVX

Test Specifications

The benchmarks were run on a Windows 10 system that has an Intel i5-8400 with 16 GB RAM and NVIDIA GeForce GTX 1080 Ti with 11 GB dedicated memory. The code and video it was processing was the same, but with different supports for AVX2 and GPU in TensorFlow. YOLOv2 with TensorFlow as the backend was used for benching with the following command: python flow --model cfg/yolo.cfg --load bin/yolo.weights --demo ../video/Ron.mp4 --gpu .8 --saveVideo

Questions

Also, thanks @fo40225 for this repo, it is an awesome time saver! Edits: Spelling

fo40225 commented 6 years ago

@TheRedMudder I saw "GPU + AVX2" case contains a "out of vram" error. You should recheck your benchmark script.

If you can add my sse2 version to benchmark, it will more clarify the result (sse2 vs avx2, official vs custom build).

GuyTraveler commented 6 years ago

Any updates regarding this issue? I was testing the inference speed difference between the multiple optimized binaries for Windows and at best noticed a 20ms improvement which certainly does not measure up to the expectations.