[Feature Request]: Details on the proxy model

Thanks for your great work. Can you provide more details on the proxy model? I have several questions.

Depth scale on the attention/mlp sub-block, which is not consistent with Tensor Program VI. Is it on purpose? Also in the original paper, lr should be $\eta/\sqrt{\text{depth}}$. https://huggingface.co/openbmb/MiniCPM-2B-sft-bf16/blob/79fbb1db171e6d8bf77cdb0a94076a43003abd9e/modeling_minicpm.py#L818
Is embedding/unembeddings initialized with the same scale with other linear layer in proxy model? It appears so in the modeling file, but embeddings/unembeddings both should be initialized with the same std across width. In other words, in the blog,

每一个二维的张量 ← 这里包括 unembeddings 里面的 Linear 吗

OpenBMB / MiniCPM