allenai / OLMo

Modeling, training, eval, and inference code for OLMo
https://allenai.org/olmo
Apache License 2.0
4.2k stars 392 forks source link

Rewrite initialization #607

Closed AkshitaB closed 4 weeks ago

AkshitaB commented 4 weeks ago

Simplifies our inscrutable initialization

Potential bugs found in initialization as a result of the refactoring (these will be fixed after feedback):

AkshitaB commented 4 weeks ago

No major concerns. I'm glad we're cleaning this up.

Why do we scale the embedding with the following factor if scale_logits=True? emb_std_factor = (0.5 * math.sqrt(self.config.d_model)) if self.config.scale_logits else 1.0.

This was another "trick" we heard works from someone else (not sure who).

Wouldn't this make more sense if we did this when weight_tying was on? I'm trying to get a sense of intuition for some of these choices/tricks.

epwalsh commented 4 weeks ago

No major concerns. I'm glad we're cleaning this up.

Why do we scale the embedding with the following factor if scale_logits=True? emb_std_factor = (0.5 * math.sqrt(self.config.d_model)) if self.config.scale_logits else 1.0.

This was another "trick" we heard works from someone else (not sure who).

Wouldn't this make more sense if we did this when weight_tying was on? I'm trying to get a sense of intuition for some of these choices/tricks.

Yea I'm guessing that's the only scenario where we tried it? It might have come from PaLM.