fengggli / gpu-computing-materials

A simple deep learning framework that optimizes task scheduling and memory usage on different CPU/GPU architectures.
1 stars 0 forks source link

perf tuning #57

Closed fengggli closed 5 years ago

fengggli commented 5 years ago

Machine configure see: https://github.com/fengggli/gpu-computing-materials/issues/58 Note: default gcc4.8 doesn't support avx512. I built gcc7.3 am using gcc with -march=skylake-avx512

Testing batch size 256 of resnet8 forward/backward time for 10 iterations

  1. baseline (gnu + -O3 + mkl-sequential 500ms)
  2. with gnu avx512(540ms)
  3. with icc avx512((538 ms)

Notes about intel-caffe

refs

fengggli commented 5 years ago

intel-caffe-thrd-1 (intelcaffe/r001hs) image

image

fengggli commented 5 years ago

intel-caffe-thrd-4 (intelcaffe/r002hs) image

image

fengggli commented 5 years ago

awnn-single-thread (awnn/r008hs) image image

** topdown tree image

fengggli commented 5 years ago

awnn-4-threads (awnn/r009hs)

summary

image

image image

fengggli commented 5 years ago

After applying a few optimizations(3991d3d, 5b3e389) suggested by intel guide, single-thread time reduced from 540 to 380 ms

Now i only need to vectorize im2col and col2im using avx512,

image

I could try but I don't know the benefits of keep trying to vectorize the code... (what's next)