Reduce CUDA kernel launch in BN

chainer / chainermn

ChainerMN: Scalable distributed deep learning with Chainer

https://chainer.org

MIT License

207 stars 57 forks source link

Closed okuta closed 6 years ago

okuta commented 6 years ago

Currently MultiNodeBatchNormalizationFunction calls 4 multiply and 2 add operations. This PR unifies it kernel launch.

keisukefukuda commented 6 years ago

Replaced by #282