alibaba / graph-learn

An Industrial Graph Neural Network Framework
Apache License 2.0
1.28k stars 267 forks source link

Synchronous Training #3

Closed power1628 closed 3 years ago

power1628 commented 4 years ago

Great work! I wonder how graph-learn do synchronous training. It would be great if there's a distributed synchronous training example.

baoleai commented 4 years ago

We use GraphSAGE as an example to show how to train in parallel on multiple machines. Currently we just provides asynchronous training example, you can simply replace it with synchronous training by using synchronous optimizer in TensorFlow.

archwalker commented 4 years ago

Initialize optimizer with use_locking=True flag in DistTFTrainer will perform synchronous training in distributed settings. Modify code here

power1628 commented 4 years ago

Initialize optimizer with use_locking=True flag in DistTFTrainer will perform synchronous training in distributed settings. Modify code here

Are you sure? The use_locking only guarantees multi-thread safety, not distributed synchronous training. BTW, I don't think this issue should be closed.

archwalker commented 4 years ago

Initialize optimizer with use_locking=True flag in DistTFTrainer will perform synchronous training in distributed settings. Modify code here

Are you sure? The use_locking only guarantees multi-thread safety, not distributed synchronous training. BTW, I don't think this issue should be closed.

I misunderstood you. Basically, to use distributed synchronous training, one needs to:

  1. Wrap optimizer with tf.train.SyncReplicasOptimizer here
  2. Add a synchronous hook in MonitoredTrainSession here

A synchronous example will be posted soon.

power1628 commented 4 years ago

This issue should be re-opened since it's not fixed.

baoleai commented 3 years ago

The question is how to synchronize training using TensorFlow, the above discussion has given advice.