Open fengggli opened 4 years ago
note:: results saved in sievert:~/intel/amplxe/projects run with /opt/intel/vtune_amplifier/bin64/amplxe-gui for intel-caffe, make sure first source intel compilers, and export extern/mkl libraries in intel-caffe source-tree
note: sgd update can be optimized easily https://software.intel.com/en-us/articles/comparison-between-intel-optimized-caffe-and-vanilla-caffe-by-intel-vtune-amplifier
todo: let data loader only load partial data
It turns out do_sgd_update_momentum and updage_regulizer_gradient are quite expensive in vggnet-hybrid-12:
Thos operation can be optimized!
setup | forwar-backward | allreduce | gradientupdate |
---|---|---|---|
initial | 261.7 | 256.3 | 147.8 |
vectorization doesn't help much here(-O3 and -m=native has vectorized the simple loops), time is spent on all_reduce operations.
Problem: single lock is critical section.
bottom-up funcion time:
Problem
using a simplified vggnet(e222802a) as testbed.
TODO: compare with intel-caffe