YeonwooSung / Pytorch_mixture-of-experts

PyTorch implementation of moe, which stands for mixture of experts
32 stars 4 forks source link

Do training and inference of MoE share the same dispatching method? #2

Open marsggbo opened 10 months ago

marsggbo commented 10 months ago

While MoE training typically uses a fixed capacity to distribute tokens evenly across all experts, my understanding is that inference involves activating experts based on predicted relevance via a softmax gate. However, your implementation seems to lack this differentiation between training and inference.

YeonwooSung commented 10 months ago

As far as I understand, what you said is the Softmax gating, and the main contribution of the MoE paper is the Top-K gating network, which adds tunable Gaussian noise in training time (not in inference time).

The Top-K gating adds sparsity to the network, which enables the network to work more hardware-efficient way.

And the main drawback of the Top-K gating is that it could face with imbalance distribution. To overcome this, there are many improved versions in theses days (i.e. ST-MoE, etc)