Open tremblerz opened 2 weeks ago
IDK if people r still working on this but I am working on a very similar structure to the grpc ptotcol but for mpi. There are some set values in comm.proto for the grpc protocol. I am favoring reusing this setup for mpi but don't want to fully commit to this if this isn't what is wanted.
Also is this issue resolved from @kathrynle20 PR? If so then I will not work on my implementation.
In our MPI implementation node A has to call
comm.send
and node B has to callcomm.recv
for a message to be successfully communicated from node A to node B. In contrast, gRPC only requires callingreceive
from the other node. gRPC does not requiresend
because each node is running a gRPC server on a parallel thread. The gRPC approach is much more scalable because it does not require synchronization between sender and receiver.Therefore, we need to write an efficient MPI based server which, similar to gRPC, runs on a parallel thread and serves models and tensors when requested by a receiving user.
If it is too much unnecessary effort, then we should also consider dropping MPI entirely.