Open jordgedu opened 6 months ago
Ah yes, a while back we were specifying the capacity factor in terms of tokens rather than multiples of the expected number of tokens per expert. We must have missed updating this when we changed it :)
Would you mind updating the other moe
scripts as well? Thanks!
Also, out of curiosity - why are you using MoE, as opposed to dMoE?
When I tested it, I found that this abnormal value resulted in a huge amount of GPU memory