PaddlePaddle / Paddle

PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署)
http://www.paddlepaddle.org/
Apache License 2.0
22.23k stars 5.58k forks source link

Fluid distributed training TODO #10279

Closed Yancey1989 closed 6 years ago

Yancey1989 commented 6 years ago

Fluid Distribute Training Features

EDL

Support different communication library

Experiment

CE

Future

panyx0718 commented 6 years ago

Some extra that might worth adding:

distributed data reader (should unify with single machine reader) evaluate different distributed training strategy (sync, async etc) influence on model accuracy. sort out differences between multi-machine-single-device and multi-machine-multi-device better integration with single-machine training think about more flexible user-customized device placement for multi-machine training.

panyx0718 commented 6 years ago

fault-tolerance is a basic distributed training feature that probably doesn't belong to EDL only.

seiriosPlus commented 6 years ago

checkpoint need to be added to train feature.

typhoonzero commented 6 years ago

Maybe we should devide fault-tolerance to several parts:

typhoonzero commented 6 years ago

The overall future roadmap should include the following parts:

  1. Complete features of fluid distributed
    • [ ] code clean up and polish
    • [ ] implement LARS -- @typhoonzero doing
    • [ ] pserver checkpointing
    • [ ] init trainer weights from pserver
    • [ ] distributed lookup table
    • [ ] full overlapping with parallel executor and dist training
    • [ ] complete async training, pserver use parallel executor
    • [ ] remote executor runs ProgramDesc (depend on "Complete Fluid")
  2. Able switch between communication libraries for different use cases.
    • [ ] grpc performance enhancement
    • [ ] OpenMPI with RDMA and GPU Direct
    • [ ] NCCL2 with multi-node implement
    • [ ] follow up brpc
  3. EDL
    • [ ] master implement
    • [ ] etcd operators
  4. CE
Yancey1989 commented 6 years ago

Thanks, @panyx0718 @seiriosPlus @typhoonzero, I updated this issue followed by your comments.

gongweibao commented 6 years ago

Do we need design communication backend's abstract interface to be compatible with various implementations:

gongweibao commented 6 years ago

I think that it's maybe many things to do and we'd better do them with orders, classification, and priority.

typhoonzero commented 6 years ago

Closing this issue, most of the work are done except brpc and EDL related.