Open zhipeng93 opened 3 years ago
When you mentioned "asynchronous data parallel training", do you mean asynchronous parallel between Python threads and GPU computing, or asynchronous parallel between model diff AllReduce and Backward computing?
If it is the former, you can refer to this example: TrainNet().async_ get(callback)
of ResNet50 model in OneFlow-Benchmark repo:
def main():
InitNodes(args)
flow.env.log_dir(args.log_dir)
snapshot = Snapshot(args.model_save_dir, args.model_load_dir)
for epoch in range(args.num_epochs):
metric = Metric(desc='train', calculate_batches=args.loss_print_every_n_iter,
batch_size=train_batch_size, loss_key='loss')
for i in range(epoch_size):
TrainNet().async_get(metric.metric_cb(epoch, i))
if args.val_data_dir:
metric = Metric(desc='validation', calculate_batches=num_val_steps,
batch_size=val_batch_size)
for i in range(num_val_steps):
InferenceNet().async_get(metric.metric_cb(epoch, i))
snapshot.save('epoch_{}'.format(epoch))
If it is the latter, Model diff AllReduce and Backward computing in OneFlow are ALWAYS asynchronous and parallel.
I guess, asynchronous data parallelism refers to ASP and SSP.
When you mentioned "asynchronous data parallel training", do you mean asynchronous parallel between Python threads and GPU computing, or asynchronous parallel between model diff AllReduce and Backward computing?
If it is the former, you can refer to this example:
TrainNet().async_ get(callback)
of ResNet50 model in OneFlow-Benchmark repo:def main(): InitNodes(args) flow.env.log_dir(args.log_dir) snapshot = Snapshot(args.model_save_dir, args.model_load_dir) for epoch in range(args.num_epochs): metric = Metric(desc='train', calculate_batches=args.loss_print_every_n_iter, batch_size=train_batch_size, loss_key='loss') for i in range(epoch_size): TrainNet().async_get(metric.metric_cb(epoch, i)) if args.val_data_dir: metric = Metric(desc='validation', calculate_batches=num_val_steps, batch_size=val_batch_size) for i in range(num_val_steps): InferenceNet().async_get(metric.metric_cb(epoch, i)) snapshot.save('epoch_{}'.format(epoch))
If it is the latter, Model diff AllReduce and Backward computing in OneFlow are ALWAYS asynchronous and parallel.
Thanks for the quick reply @chengtbf @yuanms2 . I mean ASP and SSP (i.e., asynchronous parallel between model diff AllReduce and Backward computing). It is hard to imagine how to support ASP in oneflow since it uses descentralized scheduling. Will you also include a parameter server node or just use allreduce for data parallel training?
For example, we have 3 workers for data parallel training, can we proceed to update the model if we only get two updates of model diffs instead of three? It would be great if there is an example for this.
Is there an example for asynchronous data parallel training?