If add_unit_offset flag is set, RMSNorm will do $output = output \times (1.f + weight)$. RMSNorm in Llama does not have an add_unit_offset operation, it only does $output = output \times weight$.
The differences between Gemma and Llama
Multiply llama's input embeddings by $\sqrt{\text{hidden size}}$ -- gemma calls it normalization and applies to all inputs(be it from vocab or passed directly)
Add 1 to weights of LlamaRMSLayerNorm. Gemma's RMSNorm returns $output \times (1.f + weight)$, llama doesn't add 1.
The token embedding layer's weight is tied with lm_head.
What's new?
Op Changed
Split
A new SplitOp constructor with
each_dims
option.Support operation like(in python API):
RMSNorm
A new RMSNorm constructor with
add_unit_offset
flag.If
add_unit_offset
flag is set, RMSNorm will do $output = output \times (1.f + weight)$. RMSNorm in Llama does not have anadd_unit_offset
operation, it only does $output = output \times weight$.The differences between Gemma and Llama