What's new?

Single precision inference support for the gemma-2B model.

Op Changed

Split

A new SplitOp constructor with each_dims option.

Split(const std::vector<int> &each_dims, Chl split_dim, const std::string &name)

Support operation like(in python API):

qkv.split([q_size, kv_size, kv_size], dim=-1)

RMSNorm

A new RMSNorm constructor with add_unit_offset flag.

RMSNorm(int norm_size, float epsilon, bool add_unit_offset, std::string name)

If add_unit_offset flag is set, RMSNorm will do $output = output \times (1.f + weight)$. RMSNorm in Llama does not have an add_unit_offset operation, it only does $output = output \times weight$.

The differences between Gemma and Llama

Multiply llama's input embeddings by $\sqrt{\text{hidden size}}$ -- gemma calls it normalization and applies to all inputs(be it from vocab or passed directly)
Add 1 to weights of LlamaRMSLayerNorm. Gemma's RMSNorm returns $output \times (1.f + weight)$, llama doesn't add 1.
The token embedding layer's weight is tied with lm_head.
Gemma-2b uses MQA instead of MHA.

UbiquitousLearning / mllm

Single precision inference support for the gemma-2B model #75

What's new?

Op Changed

Split

RMSNorm

The differences between Gemma and Llama