Rewrite initialization - Githubissues

AkshitaB commented 4 weeks ago

Simplifies our inscrutable initialization

IMPORTANT: currently, the implementation matches the old buggy values for init in several places. See below.
Removes init_weights with its complex if-else logic.
Adds init_normal which only takes the module, the std, and optionally a cutoff_factor.
std and cutoff_factor computation is now handled in each module's reset_parameters()
Adds unit tests for initialization.
Removes implementation for kaiming_normal and fan_in InitFnType as these aren't being used anywhere. Can be added later if needed.

Potential bugs found in initialization as a result of the refactoring (these will be fixed after feedback):

[x] OLMoBlock.ff_out's normal initialization multiples std by an extra factor of 1 / math.sqrt(2 * self.config.n_layers. This potentially came from trying to incorporate full_megatron into the same function.
[x] Hardcoded values: mitchell hardcodes a cutoff_factor of 3.0 (always truncatednormal with 3.0). full_megatron hardcodes a default cutoff_factor of 3.0 (truncatednormal with config.init_cutoff_factor or 3.0). Again, this may be a result of trying to incorporate multiple inits into the same function. Ideally, the cutoff_factor should always come from the configurable config.init_cutoff_factor; do we want to set always this value to 3.0 for mitchell and megatron?
[ ] Need clarification: Why do we scale the embedding with the following factor if scale_logits=True? emb_std_factor = (0.5 * math.sqrt(self.config.d_model)) if self.config.scale_logits else 1.0
[ ] Additionally, in case of mitchell init, due to supplying the factor at multiple places in the old code, std ends up always being 0.5 when scale_logits=True!

AkshitaB commented 4 weeks ago

No major concerns. I'm glad we're cleaning this up.

Why do we scale the embedding with the following factor if scale_logits=True? emb_std_factor = (0.5 * math.sqrt(self.config.d_model)) if self.config.scale_logits else 1.0.

This was another "trick" we heard works from someone else (not sure who).

Wouldn't this make more sense if we did this when weight_tying was on? I'm trying to get a sense of intuition for some of these choices/tricks.

epwalsh commented 4 weeks ago

No major concerns. I'm glad we're cleaning this up.

Why do we scale the embedding with the following factor if scale_logits=True? emb_std_factor = (0.5 * math.sqrt(self.config.d_model)) if self.config.scale_logits else 1.0.

This was another "trick" we heard works from someone else (not sure who).

Wouldn't this make more sense if we did this when weight_tying was on? I'm trying to get a sense of intuition for some of these choices/tricks.

Yea I'm guessing that's the only scenario where we tried it? It might have come from PaLM.

allenai / OLMo

Rewrite initialization #607