1-expert worse than dense model

databricks / megablocks

Apache License 2.0

1.11k stars 154 forks source link

1-expert worse than dense model #107

Open Muennighoff opened 1 month ago

Muennighoff commented 1 month ago

I'm finding that training a 1-expert dMoE (brown) has worse training loss than an otherwise equivalent dense model (green). Is there some reason why this difference is expected or can I expect them to be the same? Thanks!