allenai / OLMo

Modeling, training, eval, and inference code for OLMo
https://allenai.org/olmo
Apache License 2.0
4.24k stars 400 forks source link

the loss spike #560

Open bpwl0121 opened 2 months ago

bpwl0121 commented 2 months ago

❓ The question

hi,

thanks for your awesome open source work! I have question regarding the loss spike during training. Do you know why the spikes occur? and from your wandb board, why do the spikes occur ONLY in the twin version. as far as I know, you just use another hardware? twin version image

normal version image

dumitrac commented 2 months ago

@bpwl0121 - thank you for the question. The two models (OLMo-7B and OLMo-7B-Twin-2T) are identical, except for differences in hardware and initialization. We showed that hardware isn't the cause of the loss spikes in an experiment.

The reason is the difference in initialization. However, these are all "fast spikes" that recover quickly with no apparent harm.

Please let me know if this answers your question.

bpwl0121 commented 2 months ago

but I found both used "mitchell" init method. correct me if I am wrong image image

dumitrac commented 2 months ago

@bpwl0121 - that is correct that both models use the "mitchell" initialization method. The difference in initialization that I was referring to is the difference in values that the model parameters received at initialization time - because the "mitchell" method specifies a probability distribution for the parameters, but not the exact parameters.

You can verify that the initial parameters are different in the two models by comparing the checkpoints at step #0. Please let me know if this answers your question

bpwl0121 commented 2 months ago

@dumitrac thanks for your explanation, but I just find the image with image and both for twin and non-twin version is false image

so where can I find the right parameter for init as you mention

the checkpoints at step#0

thanks

dirkgr commented 1 month ago

What do you mean by "find the right parameter for init"? What's the parameter you are missing?

bpwl0121 commented 1 month ago

di

I cannot remember well, but how to set different value for "mitchell" initialization method. I think I found the same init method for both training setup

dirkgr commented 3 weeks ago

The mitchell init method uses no other parameters. I guess it uses cutoff_factor, but you basically never have to touch that one.