Fluid distributed training TODO

Yancey1989 commented 6 years ago

Fluid Distribute Training Features

code cleanup and polish
implement LARS to improve training performance, #6811
fault-tolerant
- checkpointing and recovering parameters on pserver
- recover reader offset(may need master and etcd)
- trainer pre-fetch parameters from pserver after the restart
async training, #9941
distributed data reader(should unify with single machine reader)
calculate global AUC with the distributed table
initialize trainable parameters from saved parameters on a trainer
ring-base architecture to improve training performance
distributed lookup table, https://github.com/PaddlePaddle/Paddle/projects/56
full overlapping with parallel-executor on dist training
split send_op into multiple send_vars_op and fetch_vars_op, #9161

EDL

implement the master process to schedule task
etcd operator
implement CRD to support kubernetes v1.8

Support different communication library

gRPC performance enhancement
OpenMPI with RDMA and GPU direct
NCCL2 with multiple nodes
follow up bRPC

Experiment

different distributed training strategy (sync, async etc...) influence on model accuracy/throughput

CE

Auto execute benchmark-job on AWS and generate a report

Future

differences between multi-machine-single-device and multi-machine-multi-device
better integration with single-machine training
think about more flexible user-customized device placement for multi-machine training.
need to discuss whether we need the remote executor

panyx0718 commented 6 years ago

Some extra that might worth adding:

distributed data reader (should unify with single machine reader) evaluate different distributed training strategy (sync, async etc) influence on model accuracy. sort out differences between multi-machine-single-device and multi-machine-multi-device better integration with single-machine training think about more flexible user-customized device placement for multi-machine training.

panyx0718 commented 6 years ago

fault-tolerance is a basic distributed training feature that probably doesn't belong to EDL only.

seiriosPlus commented 6 years ago

checkpoint need to be added to train feature.

typhoonzero commented 6 years ago

Maybe we should devide fault-tolerance to several parts:

Base features
- checkpointing and recovering on pserver
- trainer pull checkpoint from pserver
- recover reader offset (requires master and etcd)
Clustering feature
- autostart failed trainer based on cluster system (Kubernetes etc.)
- autoscale trainer based on cluster system.

typhoonzero commented 6 years ago

The overall future roadmap should include the following parts:

Complete features of fluid distributed
- [ ] code clean up and polish
- [ ] implement LARS -- @typhoonzero doing
- [ ] pserver checkpointing
- [ ] init trainer weights from pserver
- [ ] distributed lookup table
- [ ] full overlapping with parallel executor and dist training
- [ ] complete async training, pserver use parallel executor
- [ ] remote executor runs ProgramDesc (depend on "Complete Fluid")
Able switch between communication libraries for different use cases.
- [ ] grpc performance enhancement
- [ ] OpenMPI with RDMA and GPU Direct
- [ ] NCCL2 with multi-node implement
- [ ] follow up brpc
EDL
- [ ] master implement
- [ ] etcd operators
CE

Yancey1989 commented 6 years ago

Thanks, @panyx0718 @seiriosPlus @typhoonzero, I updated this issue followed by your comments.

gongweibao commented 6 years ago

Do we need design communication backend's abstract interface to be compatible with various implementations:

Sync: nccl, mpi...
Async: RPC

gongweibao commented 6 years ago

I think that it's maybe many things to do and we'd better do them with orders, classification, and priority.

typhoonzero commented 6 years ago

Closing this issue, most of the work are done except brpc and EDL related.

PaddlePaddle / Paddle