bytedance / byteps

A high performance and generic framework for distributed DNN training
Other
3.63k stars 490 forks source link

gradient compression updates #395

Open jasperzhong opened 3 years ago

jasperzhong commented 3 years ago

This PR includes:

  1. support PyTorch and Apex.
  2. reduce compression overhead for fp16 gradients by converting to fp32 before compressing and converting to fp16 when returned to the application.
  3. improve the random-k algorithm by using the same seed for all workers and servers. Our results show that the improved algorithm can successfully train ResNet-50 on ImageNet without accuracy loss.
  4. achieve workload balance by using the original size as the workload estimation in servers.
  5. add vanilla error-feedback, which does not use learning rate for correction.
  6. add sparse error-feedback for the random-k algorithm.
  7. update docs.

Bug fixes:

  1. fix MXNet's extension linking PyTorch's libraries. (setup.py)

The PR does not cover passing learning rate to remote servers. It also does not cover hang issue in MXNet. The PR is ready for merge.

cc: @eric-haibin-lin @szhengac