Closed fengggli closed 4 years ago
batch size 128, using different number of threads(50f9ae7)
comand: ./bench/bench-net-resnet 128 8
-DUSE_SEQUENTIAL_BLAS=on or off
cmake -DUSE_MKL=on -DAWNN_USE_FLT32=on .., all figures are generated using flt32 in this page.
currently I am using thread 0 to accumulate all the gradient, get the average and then each thread get a copy. (b7ca1e8ee4803fe0438e2225b8367a200df17c54), difference between two bars are the time for all_reduce communication(and the synchronization time).
currently it doesn't scale well >=16 threads, reason might be the multiple pthread_barrier i used in the all_reduce.
colordiff -r Intel_Caffe/ NUMA_Caffe/|less|grep 'diff -r' | awk 'BEGIN{FS="."} {print $NF}'|sort|uniq
cpp
example
hpp
proto
prototxt
sh
Show diff
colordiff -x "*.example" -x "*out_*" -r Intel_Caffe/ NUMA_Caffe/|less
I made several improvements on my code base (largely refactorize memory allocation and layer forward/backward, remove unnessary memory copies, etc:
cmake -DUSE_MKL=on -DAWNN_USE_FLT32=on ..
, run with Time breakdown for 16 thread: forward/backward: 83.32ms, allreduce: 32.24ms, upgrade local gradient .327 ms
(py36) lifen@sievert(:):~/Workspace/gpu-computing-materials/build_legacy$./bench/bench-net-resnet 128 16
Opening Training data
Opening Testing data
worker0, Iter=0, Loss 2.60
worker0, Iter=1, Loss 2.55
worker0, Iter=2, Loss 2.48
worker0, Iter=3, Loss 2.42
worker0, Iter=4, Loss 2.39
worker0, Iter=5, Loss 2.37
worker0, Iter=6, Loss 2.34
worker0, Iter=7, Loss 2.32
worker0, Iter=8, Loss 2.31
worker0, Iter=9, Loss 2.30
[WRN]:[thread 0]: time-per-iteration (116.777 ms), forward-backward (91.423 ms), allreduce (24.933 ms), gradientupdate (0.420 ms)
[WRN]:joined!
see here https://github.com/fengggli/gpu-computing-materials/pull/56
bvlc caffe with mkl: 1137 forward/back time
cmake -DUSE_MKL=on -DAWNN_USE_FLT32=on -DCMAKE_BUILD_TYPE=Release ..
cmake -DBLAS=mkl -DCMAKE_BUILD_TYPE=Release -DCMAKE_EXPORT_COMPILE_COMMANDS=on ..
cmake -DBLAS=mkl -DCMAKE_BUILD_TYPE=Release -DCMAKE_EXPORT_COMPILE_COMMANDS=on ..
This doesn't look good, I shall either:
in the makefile.config.stampede, some options:
1. USE_MLSL=1
2. INCLUDE_DIRS, LIBRARY_DIRS
3. ALLOW_LMDB_NOLOCK := 1
in makefile.stampede:
105c105
< DYNAMIC_VERSION_REVISION := 4
---
> DYNAMIC_VERSION_REVISION := 3
454c454
< COMMON_FLAGS += -DNDEBUG -O3 -xHost -no-prec-div -fp-model fast=2
---
> COMMON_FLAGS += -DNDEBUG -O3 -xHost -xCOMMON-AVX512 -no-prec-div -fp-model fast=2
I managed to build successfully in stampede2 with the makefile provided by TACC, after:
*Previously I use the cmake, which does append any intel-specific complile flags.
CC=icc CXX=icpc USE_HDF5=0 make
Add work threads, so that:
Step1:
Clear out tensor allocations since all weights(or models) needed to be duplicated in each work threads.(work thread will call resnet_init, resnet_loss, and resnet_finalize seperately) Step2:Test work thread with conv layers.(same as above) Step3: Test work thread with resnet.Note that the communications between work threads only happens during all-reduce phase. The effect of NUMA will be on:
machine configurations see https://github.com/fengggli/gpu-computing-materials/issues/58