OpenBMB / MiniCPM

MiniCPM3-4B: An edge-side LLM that surpasses GPT-3.5-Turbo.
Apache License 2.0
6.95k stars 440 forks source link

[Feature Request]: Details on the proxy model #117

Open yzlnew opened 5 months ago

yzlnew commented 5 months ago

Feature request / 功能建议

Thanks for your great work. Can you provide more details on the proxy model? I have several questions.

  1. Depth scale on the attention/mlp sub-block, which is not consistent with Tensor Program VI. Is it on purpose? Also in the original paper, lr should be $\eta/\sqrt{\text{depth}}$. https://huggingface.co/openbmb/MiniCPM-2B-sft-bf16/blob/79fbb1db171e6d8bf77cdb0a94076a43003abd9e/modeling_minicpm.py#L818
  2. Is embedding/unembeddings initialized with the same scale with other linear layer in proxy model? It appears so in the modeling file, but embeddings/unembeddings both should be initialized with the same std across width. In other words, in the blog,

    每一个二维的张量 ← 这里包括 unembeddings 里面的 Linear 吗