Liuhong99 / Sophia

The official implementation of “Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training”
MIT License
938 stars 52 forks source link

Having trouble replicating the result #10

Closed nalzok closed 1 year ago

nalzok commented 1 year ago

To replicate your experiments on GPT-2 125M (small), I executed the following commands. Note that I am running both experiments on the same node with eight A100 GPUs, and fixing the nproc_per_node, batch_size, and gradient_accumulation_steps for maximum comparability.

$ torchrun --standalone --nproc_per_node=8 train_sophiag.py config/train_gpt2_small_sophiag.py --batch_size=12 --gradient_accumulation_steps=5      # Sophia-G
$ torchrun --standalone --nproc_per_node=8 train_adam.py config/train_gpt2_small_adam.py --batch_size=12 --gradient_accumulation_steps=5            # Adam

Here is the WandB report: https://api.wandb.ai/links/nalzok/o4r3czax

While there are some differences in lr, param_norm, momentum_norm, and train/clip_rate, I cannot see much improvement in train/loss and val/loss. Can you check if the code on GitHub is the latest version?

nalzok commented 1 year ago

I was informed by someone (who presumably wants to stay anonymous since he contacted me through email instead of leaving a comment here) that the hyperparameter rho should be 20 for Sophia-G as reported in the paper just below Table 2, but here we have

https://github.com/Liuhong99/Sophia/blob/2c83ea341fa0c55d82796b275868257f7fc60eaa/config/train_gpt2_medium_sophiag.py#L38

However, after some digging, I found a commit that did some reparameterization with respect to rho: https://github.com/Liuhong99/Sophia/commit/32b7fb64c8a98ce4b796afdf7a299ffa56e95e3f. It effectively decreases rho from 480 * 30 to 0.03. Is that intentional?

Liuhong99 commented 1 year ago

Sorry to hear this. Could you please try the following hyperparameters

learning_rate = 5e-4 # max learning rate
weight_decay = 2e-1
beta1 = 0.965
beta2 = 0.99
grad_clip = 1.0 # clip gradients at this value, or disable if == 0.0
# learning rate decay settings
decay_lr = True # whether to decay the learning rate
rho = 0.04
interval = 10
warmup_iters = 2000 # how many steps to warm up for
min_lr = 1e-5 

This should be better than the run in the wandb report I shared. Could you please let me know what hyper parameters you are using for AdamW? It should be

optimizer_name = 'adamw'
learning_rate = 6e-4 # max learning rate
weight_decay = 1e-1
beta1 = 0.9
beta2 = 0.95
grad_clip = 1.0 # clip gradients at this value, or disable if == 0.0
# learning rate decay settings
decay_lr = True # whether to decay the learning rate
warmup_iters = 2000 # how many steps to warm up for
min_lr = 3e-5 

Actually in the report you shared, Sophia-G is better than AdamW.

image

I think the loss of Sophia-G is 0.02 worse than my run and AdamW is 0.01 better than my run. Let me know if you have further questions!

nalzok commented 1 year ago

Thanks for the reply!

Could you please try the following hyperparameters

Sure, I have started a run with the Sophia-G hyperparameters you provided. You can even monitor the progress in real-time via the WandB report linked above: https://api.wandb.ai/links/nalzok/o4r3czax (spoiler: it looks promising!)

Could you please let me know what hyper parameters you are using for AdamW?

I am using the hyperparameters you provided in train_gpt2_small_adam.py. They are the same as the ones you listed.

https://github.com/Liuhong99/Sophia/blob/bff9df9b584e2084fe037af1ab38f4db31f0acca/config/train_gpt2_small_adam.py#L26-L36

Actually in the report you shared, Sophia-G is better than AdamW.

Yes, you are right. I should have zoomed in because the high loss values at the beginning dwarf everything.

I think the loss of Sophia-G is 0.02 worse than my run and AdamW is 0.01 better than my run.

It feels like you are trying quite hard to tune the hyperparameters for Sophia-G, whereas for AdamW you simply cite the "well-established" default. How would you defend yourselves against the suspicion of p-hacking and justify that the 2x speed-up reported in your paper is not an artifact of overfitting to the validation set?

Additionally, I think the experiment can be improved by including some error bars. I understand that a decrease in the order of 0.01 in the validation loss is already a big deal, but the loss curve looks quite noisy.

Liuhong99 commented 1 year ago

including some error bars

Thanks for the suggestion! We will include results with error bars for 125m runs in the next version. However, error bars for larger models are not even feasible because we can only afford one or two run. We are using larger validation set to make the validation loss curve less noisy in another code base. That code base will be released soon.

It feels like you are trying quite hard to tune the hyperparameters for Sophia-G, whereas for AdamW you simply cite the "well-established" default. How would you defend yourselves against the suspicion of p-hacking and justify that the 2x speed-up reported in your paper is not an artifact of overfitting to the validation set

We did not tune the hyperparameters for Sophia-G much. Actually this hyper-parameters I sent you is the third run I ever tried and could be far from optimal. (I'm still using too conservative LR and WD) All hyper-parameters transfer to larger models except the learning rate which is also the standard practice. The AdamW well-established hyper-parameters are already very well-tuned and aggressive (even on the edge of instability). 6e-4 learning rate fails with some random seeds and we selected the random seed with which it works. For larger models we have to include other tricks to make Adam work.

nalzok commented 1 year ago

I have tried the hyperparameters you provided, and the result turns out to be worse than the last Sophia-G run. Here is the report again for your convenience: https://api.wandb.ai/links/nalzok/o4r3czax

Screen Shot 2023-05-31 at 09 28 39

Moreover, the clip rate surges after 50k steps. The same goes for the win rate and Hessian norm.

Screen Shot 2023-05-31 at 09 29 10

Can you share your WandB report with the above-mentioned hyperparameters for Sophia-G ?

Liuhong99 commented 1 year ago

The report can be seen https://api.wandb.ai/links/hliu99/rs9tp0rb. The green curve is the run with the new hyperparamters I sent you. clip rate, win rate, and Hessian norm all remain stable throughout training. I think the only difference is the machines. I'm using A5000s, but I guess that should not matter.