HomebrewNLP / Olmax

HomebrewNLP in JAX flavour for maintable TPU-Training
BSD 2-Clause "Simplified" License
45 stars 5 forks source link

Weight-Tie MoE #88

Closed ClashLuke closed 1 year ago

ClashLuke commented 1 year ago

This PR reduces the number of parameters from 9.6B to 1.6B and performs as well as untied MoE for the first 2 billion tokens: grafik

With that, both are significantly better than the baseline: grafik

ClashLuke commented 1 year ago

Better than MoE after 10B tokens