Example for async distributed training

zhipeng93 commented 3 years ago

Is there an example for asynchronous data parallel training?

chengtbf commented 3 years ago

When you mentioned "asynchronous data parallel training", do you mean asynchronous parallel between Python threads and GPU computing, or asynchronous parallel between model diff AllReduce and Backward computing?

If it is the former, you can refer to this example: TrainNet().async_ get(callback) of ResNet50 model in OneFlow-Benchmark repo:

https://github.com/Oneflow-Inc/OneFlow-Benchmark/blob/master/Classification/cnns/of_cnn_train_val.py#L125

def main():
    InitNodes(args)
    flow.env.log_dir(args.log_dir)

    snapshot = Snapshot(args.model_save_dir, args.model_load_dir)
    for epoch in range(args.num_epochs):
        metric = Metric(desc='train', calculate_batches=args.loss_print_every_n_iter,
                        batch_size=train_batch_size, loss_key='loss')
        for i in range(epoch_size):
            TrainNet().async_get(metric.metric_cb(epoch, i))

        if args.val_data_dir:
            metric = Metric(desc='validation', calculate_batches=num_val_steps, 
                            batch_size=val_batch_size)
            for i in range(num_val_steps):
                InferenceNet().async_get(metric.metric_cb(epoch, i))
        snapshot.save('epoch_{}'.format(epoch))

If it is the latter, Model diff AllReduce and Backward computing in OneFlow are ALWAYS asynchronous and parallel.

yuanms2 commented 3 years ago

I guess, asynchronous data parallelism refers to ASP and SSP.

zhipeng93 commented 3 years ago

When you mentioned "asynchronous data parallel training", do you mean asynchronous parallel between Python threads and GPU computing, or asynchronous parallel between model diff AllReduce and Backward computing?

If it is the former, you can refer to this example: TrainNet().async_ get(callback) of ResNet50 model in OneFlow-Benchmark repo:

https://github.com/Oneflow-Inc/OneFlow-Benchmark/blob/master/Classification/cnns/of_cnn_train_val.py#L125
def main():
    InitNodes(args)
    flow.env.log_dir(args.log_dir)

    snapshot = Snapshot(args.model_save_dir, args.model_load_dir)
    for epoch in range(args.num_epochs):
        metric = Metric(desc='train', calculate_batches=args.loss_print_every_n_iter,
                        batch_size=train_batch_size, loss_key='loss')
        for i in range(epoch_size):
            TrainNet().async_get(metric.metric_cb(epoch, i))

        if args.val_data_dir:
            metric = Metric(desc='validation', calculate_batches=num_val_steps, 
                            batch_size=val_batch_size)
            for i in range(num_val_steps):
                InferenceNet().async_get(metric.metric_cb(epoch, i))
        snapshot.save('epoch_{}'.format(epoch))
If it is the latter, Model diff AllReduce and Backward computing in OneFlow are ALWAYS asynchronous and parallel.

Thanks for the quick reply @chengtbf @yuanms2 . I mean ASP and SSP (i.e., asynchronous parallel between model diff AllReduce and Backward computing). It is hard to imagine how to support ASP in oneflow since it uses descentralized scheduling. Will you also include a parameter server node or just use allreduce for data parallel training?

For example, we have 3 workers for data parallel training, can we proceed to update the model if we only get two updates of model diffs instead of three? It would be great if there is an example for this.

Oneflow-Inc / oneflow

Example for async distributed training #4909