hybrid parallelism and its stragegy - Githubissues

fengggli / gpu-computing-materials

A simple deep learning framework that optimizes task scheduling and memory usage on different CPU/GPU architectures.

1 stars 0 forks source link

hybrid parallelism and its stragegy #60

Open fengggli opened 4 years ago

fengggli commented 4 years ago

using a simplified vggnet(e222802a) as testbed.

resnet has limited communications.
vgg net can be extended easily with different variations, with which we can test whether we can provide better strategy.

TODO: compare with intel-caffe

fengggli commented 4 years ago

note:: results saved in sievert:~/intel/amplxe/projects run with /opt/intel/vtune_amplifier/bin64/amplxe-gui for intel-caffe, make sure first source intel compilers, and export extern/mkl libraries in intel-caffe source-tree

update

dim_get_capacity and some related macros are now replaced with more efficient implementation.
~~quite different speed: might placed in different numa nodes?~~ (no, this runs with 4 threads, main thread get roughly 1/4 cpu time)

run with 12 threads

note: sgd update can be optimized easily https://software.intel.com/en-us/articles/comparison-between-intel-optimized-caffe-and-vanilla-caffe-by-intel-vtune-amplifier

baseline

without explicit control on memory allocation:
Lauch with different numa settings:

fengggli commented 4 years ago

intel caffe vgg simple with 12 threads

todo: let data loader only load partial data

It turns out do_sgd_update_momentum and updage_regulizer_gradient are quite expensive in vggnet-hybrid-12:

Thos operation can be optimized!

fengggli commented 4 years ago

add avx vectorization.

./bench/bench-net-hybrid vggnet 128 12 10

setup	forwar-backward	allreduce	gradientupdate
initial	261.7	256.3	147.8

vectorization doesn't help much here(-O3 and -m=native has vectorized the simple loops), time is spent on all_reduce operations.

Can we utilize topology to reduce communications, even if we are using a simple reduce policy?

fengggli commented 4 years ago

strategy 1

worker 0 gather all weights gradient.
worker 0 get average of gradient and update to everyone,
each worker updates its own weight

Problem: single lock is critical section.

strategy 2 (vggnet-tree-allreduce profile, another "vggnet-tree-allreduce-O3 shows similar stats")

awnn_parallel(step_all_reduce)
root do sgd
awnn_parallel(step_bcast)

bottom-up funcion time:

Problem

strategy 3

Since thread shared the memory, just let root
zero matrix(fill_scalar can be simplified)
shall have parallel data loader

get the numbers.