UbiquitousLearning / mllm

Fast Multimodal LLM on Mobile Devices
https://ubiquitouslearning.github.io/mllm_website
MIT License
504 stars 57 forks source link

Single precision inference support for the gemma-2B model #75

Closed chenghuaWang closed 7 months ago

chenghuaWang commented 7 months ago

What's new?

image

Op Changed

Split

A new SplitOp constructor with each_dims option.

Split(const std::vector<int> &each_dims, Chl split_dim, const std::string &name)

Support operation like(in python API):

qkv.split([q_size, kv_size, kv_size], dim=-1)

RMSNorm

A new RMSNorm constructor with add_unit_offset flag.

RMSNorm(int norm_size, float epsilon, bool add_unit_offset, std::string name)

If add_unit_offset flag is set, RMSNorm will do $output = output \times (1.f + weight)$. RMSNorm in Llama does not have an add_unit_offset operation, it only does $output = output \times weight$.

The differences between Gemma and Llama

  1. Multiply llama's input embeddings by $\sqrt{\text{hidden size}}$ -- gemma calls it normalization and applies to all inputs(be it from vocab or passed directly)
  2. Add 1 to weights of LlamaRMSLayerNorm. Gemma's RMSNorm returns $output \times (1.f + weight)$, llama doesn't add 1.
  3. The token embedding layer's weight is tied with lm_head.
  4. Gemma-2b uses MQA instead of MHA.