Closed yucc-leon closed 10 months ago
The llama-7B and GPT-7B mentioned in our technical report are toy models trained only for 200 B tokens. This preliminary experiment is only intended to verify the superiority of llama architecture (rope + rmsnorm + swiglu).
In your tech report Chaper 3 you made a comparison between llama-7B and (your) GPT-7B, but you finally released a 13B model. So there are two questions: