Open justheuristic opened 2 years ago
Hi @justheuristic ,
Thanks for your interests in our work! It is great to discuss this here.
As you may also see, we limit our scope under the regime that the network condition is stable and can be estimated relatively accurately by some network profiling. This might be violated in reality for sure. It is a very interesting problem about how to handle the dynamic decentralized environment with fault tolerance, in fact, this is a problem with the highest priority in our todo list. But to be honest, the answer to this question would be that we have no idea what should be the optimal design yet at the moment.
BTW, we are aware of the Swarm parallelism paper. In fact, we appreciate this paper and the other papers from the group on this topic!
Best wishes, Binhang
Hi, @BinhangYuan I'm sorry, i didn't mean to skew the discussion towards the dynamic environment.
Can you please elaborate on what happens in your setup from a system design perspective?
In a static (or slowly changing) hardware configuration, one can indeed measure the network properties ahead of time. But how would nodes perform that in a decentralized setting? Would they elect a temporary "leader" that runs the profiler and optimization -- or follow some decentralized protocol?
First of all, thanks for the paper! It was very intriguing to view model parallelism as an optimization problem in itself.
I wonder how would such scheduling work in a fully decentralized system? Naively, you could run it concurrently on all nodes in hope that they find the same solution.
However, this naive option may be difficult to implement in geographically distributed networks: if nodes observe slightly different network bandwith, or if they take network measurements at a different time, they may end up with different solutions.
Is there a way to guarantee such network is consistent? I mean, you can always elect a "leader" or let nodes vote on the solution, but perhaps there are more natural way to approach this. What would you suggest?
p.s. another group that i'm in close contact faced similar issue their paper, and they ended up with a heuristic load-balancing rule where nodes greedily switch pipeline stages. However, unlike your work, they do not prove that such rule leads to optimal throughput.