reduce compression overhead for fp16 gradients by converting to fp32 before compressing and converting to fp16 when returned to the application.
improve the random-k algorithm by using the same seed for all workers and servers. Our results show that the improved algorithm can successfully train ResNet-50 on ImageNet without accuracy loss.
achieve workload balance by using the original size as the workload estimation in servers.
add vanilla error-feedback, which does not use learning rate for correction.
add sparse error-feedback for the random-k algorithm.
This PR includes:
Bug fixes:
The PR does not cover passing learning rate to remote servers. It also does not cover hang issue in MXNet. The PR is ready for merge.
cc: @eric-haibin-lin @szhengac