fengggli / gpu-computing-materials

A simple deep learning framework that optimizes task scheduling and memory usage on different CPU/GPU architectures.
1 stars 0 forks source link

hybrid parallelism and its stragegy #60

Open fengggli opened 4 years ago

fengggli commented 4 years ago

using a simplified vggnet(e222802a) as testbed.

  1. resnet has limited communications.
  2. vgg net can be extended easily with different variations, with which we can test whether we can provide better strategy.

TODO: compare with intel-caffe

fengggli commented 4 years ago

note:: results saved in sievert:~/intel/amplxe/projects run with /opt/intel/vtune_amplifier/bin64/amplxe-gui for intel-caffe, make sure first source intel compilers, and export extern/mkl libraries in intel-caffe source-tree

update

  1. dim_get_capacity and some related macros are now replaced with more efficient implementation.
  2. quite different speed: might placed in different numa nodes? (no, this runs with 4 threads, main thread get roughly 1/4 cpu time)

image

run with 12 threads

image

note: sgd update can be optimized easily https://software.intel.com/en-us/articles/comparison-between-intel-optimized-caffe-and-vanilla-caffe-by-intel-vtune-amplifier

baseline

  1. without explicit control on memory allocation: image
  2. Lauch with different numa settings: image
fengggli commented 4 years ago

intel caffe vgg simple with 12 threads

image

todo: let data loader only load partial data

It turns out do_sgd_update_momentum and updage_regulizer_gradient are quite expensive in vggnet-hybrid-12: image

Thos operation can be optimized! image

fengggli commented 4 years ago

add avx vectorization.

  1. ./bench/bench-net-hybrid vggnet 128 12 10
setup forwar-backward allreduce gradientupdate
initial 261.7 256.3 147.8

vectorization doesn't help much here(-O3 and -m=native has vectorized the simple loops), time is spent on all_reduce operations.

fengggli commented 4 years ago

strategy 1

Problem: single lock is critical section.

strategy 2 (vggnet-tree-allreduce profile, another "vggnet-tree-allreduce-O3 shows similar stats")

image bottom-up funcion time:

image

Problem

strategy 3

  1. Since thread shared the memory, just let root
  2. zero matrix(fill_scalar can be simplified)
  3. shall have parallel data loader