Liuhong99 / Sophia

The official implementation of “Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training”
MIT License
938 stars 52 forks source link

Trying to reproduce: AdamW better than SophiaG after tuning #31

Closed adefazio closed 1 year ago

adefazio commented 1 year ago

I'm running the small model, 16 V100 GPUS, float32, 'batch_size': 6, 'gradient_accumulation_steps': 5. I ran a sweep over LR values and decay values. I found that the best results where with LR 0.001 for both methods, and decay 0.2.

I'm getting (valid loss) SophiaG 2.876 (compare to your value 2.894) and AdamW 2.869 (you get 2.927). If I use LR 0.0003 for SophiaG, it does give lower valid loss for most of the run, but eventually the other runs overtake it, see the plot below. There is a lot of noise in these results so I'm inclined to believe that SophiaG behaves similarly to AdamW in practice in terms of final validation loss. Using smaller LR values often gives faster convergence initially but worse final results, this is a common pattern so I'm hesitant to read anything into that either.

I'm not using your code without changes. I took the train_sophiag script and used it for Adam as well, as I was concerned that the many small differences between train_sophiag and train_adam script could be favoring Sophia. I am likewise using the configuration from the sophia config for both methods, particularly min_lr = 1.5e-5 which is smaller than the 3e-5 min lr in the adam file, and the beta values.

Could you suggest any changes/ablations or modifications I should do that might help here? I can run any additional hyper-parameter combinations you suggest.

SophiaG
Liuhong99 commented 1 year ago

Thanks for sharing these findings! I think a big difference can be the precision. I never used float32 before. I think it is known that bfloat16 will lead to worse loss for language models and your findings somewhat corroborate this? For SophiaG to perform better, it's possible to use 6e-4 peak lr and 1e-5 final lr and set beta to 0.05. This configuration gives me 2.873 validation loss in bfloat16. (Sorry I still did not update it to the repo.) I'm using 10 A5000s (bs=8, gradient_accumulation_steps=6). Currently all my servers are down. I'll run Sophia-G again in float32 when I got them back next week.

adefazio commented 1 year ago

I can run again with the parameters you suggested. I think the final_lr value is the most likely to be affecting the results given the other methods only start to pull ahead at the end.

Liuhong99 commented 1 year ago

Agreed that final_lr value is likely to affect the comparison. Although the Chinchilla paper suggested using 0.1x peak lr as the final_lr, I am sure 0.1x is not the optimal for SophiaG.

Liuhong99 commented 1 year ago

I'm also curious did you use float32 because V100 does not support bfloat16? If you are using 16G V100s, batch_size = 6 is likely to cause out of memory error.

adefazio commented 1 year ago

Yes, I’m using 32gb V100s, which don’t support bfloat. I usually use float32 as it’s less error prone than float16+scaling.

Sent from my iPhone

On Jun 29, 2023, at 7:03 PM, Liuhong99 @.***> wrote:

 I'm also curious did you use float32 because V100 does not support bfloat16? If you are using 16G V100s, batch_size = 6 is likely to cause out of memory error. — Reply to this email directly, view it on GitHub, or unsubscribe. You are ZjQcmQRYFpfptBannerStart This Message Is From an External Sender

ZjQcmQRYFpfptBannerEnd

I'm also curious did you use float32 because V100 does not support bfloat16? If you are using 16G V100s, batch_size = 6 is likely to cause out of memory error.

— Reply to this email directly, view it on GitHubhttps://github.com/Liuhong99/Sophia/issues/31#issuecomment-1613902188, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AATQ6QRZ7FYZF2PZOPNTQJ3XNYCS5ANCNFSM6AAAAAAZXIQRI4. You are receiving this because you authored the thread.Message ID: @.***>

Liuhong99 commented 1 year ago

Hi Aaron, I was able to get the server back and complete the fp32 run. That configuration eventually led to 2.846 validation loss with fp32. I used 10 A5000s (bs=4, gradient_accumulation_steps=12).

image

The gap between bfloat16 and float32 can be larger than 0.02. (the other run is in bfloat16)

image
adefazio commented 1 year ago

Ok, I'll let you know how my debugging goes.

adefazio commented 1 year ago

My rerun didn't work as well as your float32 run, I'm not sure why but I don't have further time to investigate so I'm going to close this issue.

ryanbty commented 1 year ago

Hi @adefazio,

I also want to reproduce the paper results for GPT-2 small using V100 GPUs and float32. I tried using the hyperparameters proposed in train_gpt2_small_sophiag.py and train_gpt2_small_adam.py. However, I don't find the same results as the paper. AdamW and Sophia seem to have the same behavior. Please, which hyperparameters did you take ?

In addition, training GPT-2 with Sophia and using float16 + scaling diverges the loss. Have you tried it @Liuhong99 ?

adefazio commented 1 year ago

I was not able to replicate in the end.

ryanbty commented 1 year ago

Thanks @adefazio.

Please @Liuhong99, can you give your hyperparameters to reproduce GPT-2 small results in the Figure 4 paper? For Adaw and SophiaG ?