Closed AkshitaB closed 4 weeks ago
No major concerns. I'm glad we're cleaning this up.
Why do we scale the embedding with the following factor if scale_logits=True? emb_std_factor = (0.5 * math.sqrt(self.config.d_model)) if self.config.scale_logits else 1.0.
This was another "trick" we heard works from someone else (not sure who).
Wouldn't this make more sense if we did this when weight_tying
was on? I'm trying to get a sense of intuition for some of these choices/tricks.
No major concerns. I'm glad we're cleaning this up.
Why do we scale the embedding with the following factor if scale_logits=True? emb_std_factor = (0.5 * math.sqrt(self.config.d_model)) if self.config.scale_logits else 1.0.
This was another "trick" we heard works from someone else (not sure who).
Wouldn't this make more sense if we did this when
weight_tying
was on? I'm trying to get a sense of intuition for some of these choices/tricks.
Yea I'm guessing that's the only scenario where we tried it? It might have come from PaLM.
Simplifies our inscrutable initialization
init_weights
with its complex if-else logic.init_normal
which only takes the module, the std, and optionally a cutoff_factor.reset_parameters()
kaiming_normal
andfan_in
InitFnType as these aren't being used anywhere. Can be added later if needed.Potential bugs found in initialization as a result of the refactoring (these will be fixed after feedback):
OLMoBlock.ff_out
'snormal
initialization multiples std by an extra factor of1 / math.sqrt(2 * self.config.n_layers
. This potentially came from trying to incorporatefull_megatron
into the same function.mitchell
hardcodes a cutoff_factor of 3.0 (always truncatednormal with 3.0).full_megatron
hardcodes a default cutoff_factor of 3.0 (truncatednormal withconfig.init_cutoff_factor or 3.0
). Again, this may be a result of trying to incorporate multiple inits into the same function. Ideally, the cutoff_factor should always come from the configurableconfig.init_cutoff_factor
; do we want to set always this value to 3.0 for mitchell and megatron?scale_logits=True
?emb_std_factor = (0.5 * math.sqrt(self.config.d_model)) if self.config.scale_logits else 1.0
mitchell
init, due to supplying the factor at multiple places in the old code, std ends up always being 0.5 whenscale_logits=True
!