HPDL-Group / Merak

Apache License 2.0
69 stars 9 forks source link

Understanding Data Propagation and Communication in Model Parallelism #9

Closed Hongjie1Chu closed 9 months ago

Hongjie1Chu commented 9 months ago

In the context of model parallelism, each layer's input is derived from the output of the previous layer. Could someone please explain how the output from one layer is passed to the next during training? Specifically, how does each process communicate with the others in the model parallel group after completing its respective training task? How does a layer, once it completes its computations, pass the results to the next layer? Which interface or function is responsible for transferring the output to the subsequent layer?

Specifically, referring to the model parallelism diagram, how do processes 3 and 4 obtain the computation results from processes 1 and 2 respectively? After 3 and 4 complete their individual computations, how do they communicate with each other? Furthermore, after this communication, which interface is used to pass the results to processes 5 and 6?

![Uploading v2-708c01105de92567824bd9d3456b9459_720w.png…]()

I appreciate any clarification or references to relevant documentation that could help me understand these processes better.