fengggli / gpu-computing-materials

A simple deep learning framework that optimizes task scheduling and memory usage on different CPU/GPU architectures.
1 stars 0 forks source link

Adding numa-aware work threads #54

Closed fengggli closed 4 years ago

fengggli commented 5 years ago

Add work threads, so that:

  1. each work thread has its own copy of weights
  2. each work thread get a subset of the mini-batches in each iteration, and will iterate all the images in this subset (forward and backward), gradient will be accumulated
  3. gradient will be accumulated again across all work threads(all-reduce), then each work thread will updates it's own weights.
  4. For simplicity, let each work thread call single-thread-blas.

Step1: Clear out tensor allocations since all weights(or models) needed to be duplicated in each work threads. (work thread will call resnet_init, resnet_loss, and resnet_finalize seperately) Step2: Test work thread with conv layers. (same as above) Step3: Test work thread with resnet.

Note that the communications between work threads only happens during all-reduce phase. The effect of NUMA will be on:

machine configurations see https://github.com/fengggli/gpu-computing-materials/issues/58

fengggli commented 5 years ago

without reduce

batch size 128, using different number of threads(50f9ae7) comand: ./bench/bench-net-resnet 128 8 -DUSE_SEQUENTIAL_BLAS=on or off image

fengggli commented 5 years ago

all reduce

naive all-reduce

possible improvement

  1. I could create a fake model with all gradient(before I launch work threads) to save accumulated results.
  2. I could a merge-tree structure like in MPI_All_reduce.
  3. Need to look into how numa caffe did this efficiently.
    colordiff -r Intel_Caffe/ NUMA_Caffe/|less|grep 'diff -r' | awk 'BEGIN{FS="."} {print $NF}'|sort|uniq
    cpp
    example
    hpp
    proto
    prototxt
    sh

    Show diff

    colordiff -x "*.example" -x "*out_*" -r Intel_Caffe/ NUMA_Caffe/|less
fengggli commented 5 years ago

I made several improvements on my code base (largely refactorize memory allocation and layer forward/backward, remove unnessary memory copies, etc:

image

Time breakdown for 16 thread: forward/backward: 83.32ms, allreduce: 32.24ms, upgrade local gradient .327 ms

(py36) lifen@sievert(:):~/Workspace/gpu-computing-materials/build_legacy$./bench/bench-net-resnet 128 16
Opening Training data
Opening Testing data
worker0, Iter=0, Loss 2.60
worker0, Iter=1, Loss 2.55
worker0, Iter=2, Loss 2.48
worker0, Iter=3, Loss 2.42
worker0, Iter=4, Loss 2.39
worker0, Iter=5, Loss 2.37
worker0, Iter=6, Loss 2.34
worker0, Iter=7, Loss 2.32
worker0, Iter=8, Loss 2.31
worker0, Iter=9, Loss 2.30
[WRN]:[thread 0]: time-per-iteration (116.777 ms), forward-backward (91.423 ms), allreduce (24.933 ms), gradientupdate (0.420 ms)
[WRN]:joined!

see here https://github.com/fengggli/gpu-computing-materials/pull/56

fengggli commented 5 years ago

bvlc caffe with mkl: 1137 forward/back time image

fengggli commented 5 years ago

updated results with -O3

  1. build options
    • awnn: cmake -DUSE_MKL=on -DAWNN_USE_FLT32=on -DCMAKE_BUILD_TYPE=Release ..
    • intel caffe: cmake -DBLAS=mkl -DCMAKE_BUILD_TYPE=Release -DCMAKE_EXPORT_COMPILE_COMMANDS=on ..
    • bvlc_caffe: cmake -DBLAS=mkl -DCMAKE_BUILD_TYPE=Release -DCMAKE_EXPORT_COMPILE_COMMANDS=on ..

average forward/backward time

image

normalized speedup

image

fengggli commented 5 years ago

Experiments in stampede2

  1. I tried to build caffe by myself, but some dependencies are difficult to deal with. (glog, gprotbuf, it can get really complex: http://www.andrewjanowczyk.com/installing-caffe-on-the-ohio-super-computing-osc-ruby-cluster/)
  2. then i used the prebuilt caffe in stampede2 (https://portal.tacc.utexas.edu/software/caffe) (note: -engine="MKL2017")
  3. for awnn, I used -O3

image

This doesn't look good, I shall either:

  1. profiler my code (Use vtune like https://software.intel.com/en-us/articles/comparison-between-intel-optimized-caffe-and-vanilla-caffe-by-intel-vtune-amplifier) )
  2. continue trying to build caffe by myself.
  3. I also made a requested to tacc for the build script of intel-caffe
fengggli commented 5 years ago

in gibson

  1. try to build intel-caffe in gibson!
  2. https://github.com/intel/caffe/wiki/Recommendations-to-achieve-best-performance

in stampede

in the makefile.config.stampede, some options:

1. USE_MLSL=1
2. INCLUDE_DIRS, LIBRARY_DIRS
3. ALLOW_LMDB_NOLOCK := 1

in makefile.stampede:

105c105
< DYNAMIC_VERSION_REVISION      := 4
---
> DYNAMIC_VERSION_REVISION      := 3
454c454
<       COMMON_FLAGS += -DNDEBUG -O3 -xHost -no-prec-div -fp-model fast=2
---
>       COMMON_FLAGS += -DNDEBUG -O3 -xHost -xCOMMON-AVX512 -no-prec-div -fp-model fast=2

I managed to build successfully in stampede2 with the makefile provided by TACC, after:

  1. use my own built protobuf installed in ~/software/install
  2. module load boost/1.65
  3. write a module file: to prepend those path (https://github.com/fengggli/configurations/commit/9186f0a1b550558bf8f7b4f19623dad068bbf171), module load caffe_deps will load all caffe's dependencies!, using https://github.com/fengggli/caffe/commit/e4b769936da8c208abbaeb3de073ca38c2dece40 achieves 133ms single-thread time like tacc moduled caffe
fengggli commented 5 years ago

experiments with intel caffe in stampede2.

*Previously I use the cmake, which does append any intel-specific complile flags.

exp in sievert

Guess: intel code is written in a way avx512 can help boost performance?