databricks / megablocks

Apache License 2.0
1.11k stars 154 forks source link

1-expert worse than dense model #107

Open Muennighoff opened 1 month ago

Muennighoff commented 1 month ago

I'm finding that training a 1-expert dMoE (brown) has worse training loss than an otherwise equivalent dense model (green). Is there some reason why this difference is expected or can I expect them to be the same? Thanks!

Screenshot 2024-05-08 at 10 09 05 AM