Closed Yancey1989 closed 6 years ago
Some extra that might worth adding:
distributed data reader (should unify with single machine reader) evaluate different distributed training strategy (sync, async etc) influence on model accuracy. sort out differences between multi-machine-single-device and multi-machine-multi-device better integration with single-machine training think about more flexible user-customized device placement for multi-machine training.
fault-tolerance is a basic distributed training feature that probably doesn't belong to EDL only.
checkpoint
need to be added to train feature.
Maybe we should devide fault-tolerance
to several parts:
The overall future roadmap should include the following parts:
ProgramDesc
(depend on "Complete Fluid")Thanks, @panyx0718 @seiriosPlus @typhoonzero, I updated this issue followed by your comments.
Do we need design communication backend's abstract interface to be compatible with various implementations:
I think that it's maybe many things to do and we'd better do them with orders, classification, and priority.
Closing this issue, most of the work are done except brpc and EDL related.
Fluid Distribute Training Features
EDL
Support different communication library
Experiment
CE
Future