Adding numa-aware work threads

fengggli commented 5 years ago

Add work threads, so that:

each work thread has its own copy of weights
each work thread get a subset of the mini-batches in each iteration, and will iterate all the images in this subset (forward and backward), gradient will be accumulated
gradient will be accumulated again across all work threads(all-reduce), then each work thread will updates it's own weights.
For simplicity, let each work thread call single-thread-blas.

Step1: ~~Clear out tensor allocations since all weights(or models) needed to be duplicated in each work threads.~~ (work thread will call resnet_init, resnet_loss, and resnet_finalize seperately) Step2: ~~Test work thread with conv layers.~~ (same as above) Step3: Test work thread with resnet.

Note that the communications between work threads only happens during all-reduce phase. The effect of NUMA will be on:

who allocates/first touches weights/layer buffers.
In my design, data will be co-located with computation implicitely.

machine configurations see https://github.com/fengggli/gpu-computing-materials/issues/58

fengggli commented 5 years ago

without reduce

batch size 128, using different number of threads(50f9ae7) comand: ./bench/bench-net-resnet 128 8 -DUSE_SEQUENTIAL_BLAS=on or off

fengggli commented 5 years ago

all reduce

naive all-reduce

cmake -DUSE_MKL=on -DAWNN_USE_FLT32=on .., all figures are generated using flt32 in this page.
currently I am using thread 0 to accumulate all the gradient, get the average and then each thread get a copy. (b7ca1e8ee4803fe0438e2225b8367a200df17c54), difference between two bars are the time for all_reduce communication(and the synchronization time).
currently it doesn't scale well >=16 threads, reason might be the multiple pthread_barrier i used in the all_reduce.

possible improvement

I could create a fake model with all gradient(before I launch work threads) to save accumulated results.
I could a merge-tree structure like in MPI_All_reduce.

Need to look into how numa caffe did this efficiently.

colordiff -r Intel_Caffe/ NUMA_Caffe/|less|grep 'diff -r' | awk 'BEGIN{FS="."} {print $NF}'|sort|uniq
cpp
example
hpp
proto
prototxt
sh

Show diff

colordiff -x "*.example" -x "*out_*" -r Intel_Caffe/ NUMA_Caffe/|less

fengggli commented 5 years ago

I made several improvements on my code base (largely refactorize memory allocation and layer forward/backward, remove unnessary memory copies, etc:

build with cmake -DUSE_MKL=on -DAWNN_USE_FLT32=on .., run with
results: (reg = 0.001, lr =0.01)
this is not consistent with previous results, 1. now i add link options manually instead of use icc. all-reduce time is now averaged over 10 iterations.

Time breakdown for 16 thread: forward/backward: 83.32ms, allreduce: 32.24ms, upgrade local gradient .327 ms

(py36) lifen@sievert(:):~/Workspace/gpu-computing-materials/build_legacy$./bench/bench-net-resnet 128 16
Opening Training data
Opening Testing data
worker0, Iter=0, Loss 2.60
worker0, Iter=1, Loss 2.55
worker0, Iter=2, Loss 2.48
worker0, Iter=3, Loss 2.42
worker0, Iter=4, Loss 2.39
worker0, Iter=5, Loss 2.37
worker0, Iter=6, Loss 2.34
worker0, Iter=7, Loss 2.32
worker0, Iter=8, Loss 2.31
worker0, Iter=9, Loss 2.30
[WRN]:[thread 0]: time-per-iteration (116.777 ms), forward-backward (91.423 ms), allreduce (24.933 ms), gradientupdate (0.420 ms)
[WRN]:joined!

see here https://github.com/fengggli/gpu-computing-materials/pull/56

fengggli commented 5 years ago

bvlc caffe with mkl: 1137 forward/back time

fengggli commented 5 years ago

updated results with -O3

Observed significant performance improve on awnn, but not for intel-caffe. (now awnn is faster intel-caffe)
use thread number 1~12, since in sievert we only have 12 physical cores

build options
- awnn: cmake -DUSE_MKL=on -DAWNN_USE_FLT32=on -DCMAKE_BUILD_TYPE=Release ..
- intel caffe: cmake -DBLAS=mkl -DCMAKE_BUILD_TYPE=Release -DCMAKE_EXPORT_COMPILE_COMMANDS=on ..
- bvlc_caffe: cmake -DBLAS=mkl -DCMAKE_BUILD_TYPE=Release -DCMAKE_EXPORT_COMPILE_COMMANDS=on ..

average forward/backward time

normalized speedup

fengggli commented 5 years ago

Experiments in stampede2

I tried to build caffe by myself, but some dependencies are difficult to deal with. (glog, gprotbuf, it can get really complex: http://www.andrewjanowczyk.com/installing-caffe-on-the-ohio-super-computing-osc-ruby-cluster/)
then i used the prebuilt caffe in stampede2 (https://portal.tacc.utexas.edu/software/caffe) (note: -engine="MKL2017")
for awnn, I used -O3

This doesn't look good, I shall either:

profiler my code (Use vtune like https://software.intel.com/en-us/articles/comparison-between-intel-optimized-caffe-and-vanilla-caffe-by-intel-vtune-amplifier) )
continue trying to build caffe by myself.
I also made a requested to tacc for the build script of intel-caffe

fengggli commented 5 years ago

in gibson

try to build intel-caffe in gibson!
https://github.com/intel/caffe/wiki/Recommendations-to-achieve-best-performance

in stampede

in the makefile.config.stampede, some options:

1. USE_MLSL=1
2. INCLUDE_DIRS, LIBRARY_DIRS
3. ALLOW_LMDB_NOLOCK := 1

in makefile.stampede:

105c105
< DYNAMIC_VERSION_REVISION      := 4
---
> DYNAMIC_VERSION_REVISION      := 3
454c454
<       COMMON_FLAGS += -DNDEBUG -O3 -xHost -no-prec-div -fp-model fast=2
---
>       COMMON_FLAGS += -DNDEBUG -O3 -xHost -xCOMMON-AVX512 -no-prec-div -fp-model fast=2

I managed to build successfully in stampede2 with the makefile provided by TACC, after:

use my own built protobuf installed in ~/software/install
module load boost/1.65
write a module file: to prepend those path (https://github.com/fengggli/configurations/commit/9186f0a1b550558bf8f7b4f19623dad068bbf171), module load caffe_deps will load all caffe's dependencies!, using https://github.com/fengggli/caffe/commit/e4b769936da8c208abbaeb3de073ca38c2dece40 achieves 133ms single-thread time like tacc moduled caffe

fengggli commented 5 years ago

experiments with intel caffe in stampede2.

*Previously I use the cmake, which does append any intel-specific complile flags.

in the makefile build, it will append "-DNDEBUG -O3 -xHost -xCOMMON-AVX512 -no-prec-div -fp-model fast=2" if CXX is set to icpc
i tried to reduce the flags to "-DNDEBUG -O3" still very fast!
Experiments in sievert was using gcc, needs double check

exp in sievert

cmake with -DCMAKE_C_COMPIER=icc and -DCMAKE_CXX_COMPILER=icpc .., 1thread 600ms(same!) (need set LD_LIBRARY_PATH with mkl installation)
Try to use makefile with CC=icc CXX=icpc after source intel compiler scripts
- CC=icc CXX=icpc USE_HDF5=0 make
- 761ms(MKL2017) , 885ms(MKLDNN)

Guess: intel code is written in a way avx512 can help boost performance?

I will try to build in gibson(it has avx512)
I requested soshelp to install intel parallel studio

fengggli / gpu-computing-materials