Open YJHMITWEB opened 8 months ago
Hi @YJHMITWEB, thanks for reaching out. You are correct in your observation that TRT-LLM only uses an allreduce (see here).
The allreduce step is a convenience so that the data flow is the same as TP. Instead of broadcasting each token to all the other nodes and then doing scale/bias steps (this would use all-to-all). We instead do the rescaling locally on each node and then allreduce on the results using zero tensors for uninitialised tokens.
There may be cases where the all-to-all pattern is better and we will continue actively investigating this option.
In general though, we recommend Tensor Parallelism because of the load balancing issues that are inherent in Expert Parallelism
Hi, I am running and profiling the code of the Mixtral implementation, however, neither in the code nor in the profiling, did I find any Alltoall operations.
I built the TRT engine using the following config:
I tried both
--moe_tp_mode 1
and--moe_tp_mode 2
, but seems they just end with the same tensor parallelism, with no expert parallelism enabled. Also in the nsight profiling, there are only Allreduce and Allgather calls, which seems insufficient for expert parallelism.