Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more
Description:
We are currently enabling the multi-node for mxnet sockeye and found that currently if the normalization type is valid the loss normalizer for softmax is not correct in distributed training. (softmax_output-inl.h)
The correct implementations should be:
If gradients are all-reduced in sum mode, valid_cnt should be allreduced . grads = grads / valid_cnt.
If gradients are all-reduced in average mode, valid_cnt should be allreduced too. grads = grads * node_num / valid_cnt.
The main reason is that: In topology such as SSD (CNN) or NMT (RNN), there's different valid_cnt in different nodes.
Description: We are currently enabling the multi-node for mxnet sockeye and found that currently if the normalization type is valid the loss normalizer for softmax is not correct in distributed training. (softmax_output-inl.h) The correct implementations should be: If gradients are all-reduced in sum mode, valid_cnt should be allreduced . grads = grads / valid_cnt. If gradients are all-reduced in average mode, valid_cnt should be allreduced too. grads = grads * node_num / valid_cnt. The main reason is that: In topology such as SSD (CNN) or NMT (RNN), there's different valid_cnt in different nodes.