Multigpu startup performance improvement

LeelaChessZero / lczero-training

For code etc relating to the network training process.

147 stars 119 forks source link

Multigpu startup performance improvement #147

Closed Tilps closed 3 years ago

Tilps commented 3 years ago

The grads were unused, so forcing autograph to try and setup everything to compute each individual mean across the replicas was pointless and very expensive. Saves 80 seconds for first train step on RAF's training machine.

Tilps commented 3 years ago

Grads I believe we used to include in tensorboard - but we dropped them as part of rationalizing our tensorboard data.