Closed ClashLuke closed 1 year ago
is_stacked
_stacked
pattern_match
This PR reduces the number of parameters from 9.6B to 1.6B and performs as well as untied MoE for the first 2 billion tokens:
With that, both are significantly better than the baseline:
Better than MoE after 10B tokens
is_stacked
: check for_stacked
parameter-name suffix instead of dim sizepattern_match
: looped lax.cond to single lax.switch callThis PR reduces the number of parameters from 9.6B to 1.6B and performs as well as untied MoE for the first 2 billion tokens:
With that, both are significantly better than the baseline: