Closed nalzok closed 1 year ago
I was informed by someone (who presumably wants to stay anonymous since he contacted me through email instead of leaving a comment here) that the hyperparameter rho
should be 20 for Sophia-G as reported in the paper just below Table 2, but here we have
However, after some digging, I found a commit that did some reparameterization with respect to rho
: https://github.com/Liuhong99/Sophia/commit/32b7fb64c8a98ce4b796afdf7a299ffa56e95e3f. It effectively decreases rho
from 480 * 30
to 0.03
. Is that intentional?
Sorry to hear this. Could you please try the following hyperparameters
learning_rate = 5e-4 # max learning rate
weight_decay = 2e-1
beta1 = 0.965
beta2 = 0.99
grad_clip = 1.0 # clip gradients at this value, or disable if == 0.0
# learning rate decay settings
decay_lr = True # whether to decay the learning rate
rho = 0.04
interval = 10
warmup_iters = 2000 # how many steps to warm up for
min_lr = 1e-5
This should be better than the run in the wandb report I shared. Could you please let me know what hyper parameters you are using for AdamW? It should be
optimizer_name = 'adamw'
learning_rate = 6e-4 # max learning rate
weight_decay = 1e-1
beta1 = 0.9
beta2 = 0.95
grad_clip = 1.0 # clip gradients at this value, or disable if == 0.0
# learning rate decay settings
decay_lr = True # whether to decay the learning rate
warmup_iters = 2000 # how many steps to warm up for
min_lr = 3e-5
Actually in the report you shared, Sophia-G is better than AdamW.
I think the loss of Sophia-G is 0.02 worse than my run and AdamW is 0.01 better than my run. Let me know if you have further questions!
Thanks for the reply!
Could you please try the following hyperparameters
Sure, I have started a run with the Sophia-G hyperparameters you provided. You can even monitor the progress in real-time via the WandB report linked above: https://api.wandb.ai/links/nalzok/o4r3czax (spoiler: it looks promising!)
Could you please let me know what hyper parameters you are using for AdamW?
I am using the hyperparameters you provided in train_gpt2_small_adam.py
. They are the same as the ones you listed.
Actually in the report you shared, Sophia-G is better than AdamW.
Yes, you are right. I should have zoomed in because the high loss values at the beginning dwarf everything.
I think the loss of Sophia-G is 0.02 worse than my run and AdamW is 0.01 better than my run.
It feels like you are trying quite hard to tune the hyperparameters for Sophia-G, whereas for AdamW you simply cite the "well-established" default. How would you defend yourselves against the suspicion of p-hacking and justify that the 2x speed-up reported in your paper is not an artifact of overfitting to the validation set?
Additionally, I think the experiment can be improved by including some error bars. I understand that a decrease in the order of 0.01 in the validation loss is already a big deal, but the loss curve looks quite noisy.
including some error bars
Thanks for the suggestion! We will include results with error bars for 125m runs in the next version. However, error bars for larger models are not even feasible because we can only afford one or two run. We are using larger validation set to make the validation loss curve less noisy in another code base. That code base will be released soon.
It feels like you are trying quite hard to tune the hyperparameters for Sophia-G, whereas for AdamW you simply cite the "well-established" default. How would you defend yourselves against the suspicion of p-hacking and justify that the 2x speed-up reported in your paper is not an artifact of overfitting to the validation set
We did not tune the hyperparameters for Sophia-G much. Actually this hyper-parameters I sent you is the third run I ever tried and could be far from optimal. (I'm still using too conservative LR and WD) All hyper-parameters transfer to larger models except the learning rate which is also the standard practice. The AdamW well-established hyper-parameters are already very well-tuned and aggressive (even on the edge of instability). 6e-4 learning rate fails with some random seeds and we selected the random seed with which it works. For larger models we have to include other tricks to make Adam work.
I have tried the hyperparameters you provided, and the result turns out to be worse than the last Sophia-G run. Here is the report again for your convenience: https://api.wandb.ai/links/nalzok/o4r3czax
Moreover, the clip rate surges after 50k steps. The same goes for the win rate and Hessian norm.
Can you share your WandB report with the above-mentioned hyperparameters for Sophia-G ?
The report can be seen https://api.wandb.ai/links/hliu99/rs9tp0rb. The green curve is the run with the new hyperparamters I sent you. clip rate, win rate, and Hessian norm all remain stable throughout training. I think the only difference is the machines. I'm using A5000s, but I guess that should not matter.
To replicate your experiments on GPT-2 125M (small), I executed the following commands. Note that I am running both experiments on the same node with eight A100 GPUs, and fixing the
nproc_per_node
,batch_size
, andgradient_accumulation_steps
for maximum comparability.Here is the WandB report: https://api.wandb.ai/links/nalzok/o4r3czax
While there are some differences in
lr
,param_norm
,momentum_norm
, andtrain/clip_rate
, I cannot see much improvement intrain/loss
andval/loss
. Can you check if the code on GitHub is the latest version?