HomebrewNLP / Olmax

HomebrewNLP in JAX flavour for maintable TPU-Training
BSD 2-Clause "Simplified" License
45 stars 6 forks source link

MoE + Weight Sharing #6

Open ClashLuke opened 2 years ago

ClashLuke commented 2 years ago

Like WideNet proposed, we could combine a MoE-architecture with weight sharing. Incorporating a WideNet-style architecture should increase performance, decrease training time, and reduce the number of parameters needed. This issue is about implementing such a weight-sharing protocol and benchmarking its performance.