This PyTorch package implements MoEBERT: from BERT to Mixture-of-Experts via Importance-Guided Adaptation (NAACL 2022).
97
stars
13
forks
source link
How about the performance difference between token-gate and sentence gate? #3
Open
GeneZC opened 2 years ago
How about the performance difference between token-gate and sentence gate? And how about the value of alpha for load balance loss?