Open marsggbo opened 10 months ago
As far as I understand, what you said is the Softmax gating, and the main contribution of the MoE paper is the Top-K gating network, which adds tunable Gaussian noise in training time (not in inference time).
The Top-K gating adds sparsity to the network, which enables the network to work more hardware-efficient way.
And the main drawback of the Top-K gating is that it could face with imbalance distribution. To overcome this, there are many improved versions in theses days (i.e. ST-MoE, etc)
While MoE training typically uses a fixed capacity to distribute tokens evenly across all experts, my understanding is that inference involves activating experts based on predicted relevance via a softmax gate. However, your implementation seems to lack this differentiation between training and inference.