Open mqyqlx opened 8 months ago
This is very weird. Have you been able to form any tentative hypotheses about it?
This is very weird. Have you been able to form any tentative hypotheses about it?
Not yet. I guess these two standard deviations used in Pythia-6.9B are set empirically and seem not to be calculated by a formula.
Hi, I found that the init method of parameters in pythia-6.9B model is inconsistent with the standard deviation of the step0 checkpoint. Table 6 in the paper shows that init-method is small-init and output-layer-init-method is wang-init. But I got different std values from step0 models.
Inconsistent std values:
Could you provide the real init method? Thanks!
Config Table 6:
Here are the reproducible script and results.
Results: