Closed adefazio closed 1 year ago
Thanks for sharing these findings! I think a big difference can be the precision. I never used float32 before. I think it is known that bfloat16 will lead to worse loss for language models and your findings somewhat corroborate this? For SophiaG to perform better, it's possible to use 6e-4 peak lr and 1e-5 final lr and set beta to 0.05. This configuration gives me 2.873 validation loss in bfloat16. (Sorry I still did not update it to the repo.) I'm using 10 A5000s (bs=8, gradient_accumulation_steps=6). Currently all my servers are down. I'll run Sophia-G again in float32 when I got them back next week.
I can run again with the parameters you suggested. I think the final_lr value is the most likely to be affecting the results given the other methods only start to pull ahead at the end.
Agreed that final_lr value is likely to affect the comparison. Although the Chinchilla paper suggested using 0.1x peak lr as the final_lr, I am sure 0.1x is not the optimal for SophiaG.
I'm also curious did you use float32 because V100 does not support bfloat16? If you are using 16G V100s, batch_size = 6 is likely to cause out of memory error.
Yes, I’m using 32gb V100s, which don’t support bfloat. I usually use float32 as it’s less error prone than float16+scaling.
Sent from my iPhone
On Jun 29, 2023, at 7:03 PM, Liuhong99 @.***> wrote:
I'm also curious did you use float32 because V100 does not support bfloat16? If you are using 16G V100s, batch_size = 6 is likely to cause out of memory error. — Reply to this email directly, view it on GitHub, or unsubscribe. You are ZjQcmQRYFpfptBannerStart This Message Is From an External Sender
ZjQcmQRYFpfptBannerEnd
I'm also curious did you use float32 because V100 does not support bfloat16? If you are using 16G V100s, batch_size = 6 is likely to cause out of memory error.
— Reply to this email directly, view it on GitHubhttps://github.com/Liuhong99/Sophia/issues/31#issuecomment-1613902188, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AATQ6QRZ7FYZF2PZOPNTQJ3XNYCS5ANCNFSM6AAAAAAZXIQRI4. You are receiving this because you authored the thread.Message ID: @.***>
Hi Aaron, I was able to get the server back and complete the fp32 run. That configuration eventually led to 2.846 validation loss with fp32. I used 10 A5000s (bs=4, gradient_accumulation_steps=12).
The gap between bfloat16 and float32 can be larger than 0.02. (the other run is in bfloat16)
Ok, I'll let you know how my debugging goes.
My rerun didn't work as well as your float32 run, I'm not sure why but I don't have further time to investigate so I'm going to close this issue.
Hi @adefazio,
I also want to reproduce the paper results for GPT-2 small using V100 GPUs and float32. I tried using the hyperparameters proposed in train_gpt2_small_sophiag.py
and train_gpt2_small_adam.py
. However, I don't find the same results as the paper. AdamW and Sophia seem to have the same behavior. Please, which hyperparameters did you take ?
In addition, training GPT-2 with Sophia and using float16 + scaling diverges the loss. Have you tried it @Liuhong99 ?
I was not able to replicate in the end.
Thanks @adefazio.
Please @Liuhong99, can you give your hyperparameters to reproduce GPT-2 small results in the Figure 4 paper? For Adaw and SophiaG ?
I'm running the small model, 16 V100 GPUS, float32, 'batch_size': 6, 'gradient_accumulation_steps': 5. I ran a sweep over LR values and decay values. I found that the best results where with LR 0.001 for both methods, and decay 0.2.
I'm getting (valid loss) SophiaG 2.876 (compare to your value 2.894) and AdamW 2.869 (you get 2.927). If I use LR 0.0003 for SophiaG, it does give lower valid loss for most of the run, but eventually the other runs overtake it, see the plot below. There is a lot of noise in these results so I'm inclined to believe that SophiaG behaves similarly to AdamW in practice in terms of final validation loss. Using smaller LR values often gives faster convergence initially but worse final results, this is a common pattern so I'm hesitant to read anything into that either.
I'm not using your code without changes. I took the train_sophiag script and used it for Adam as well, as I was concerned that the many small differences between train_sophiag and train_adam script could be favoring Sophia. I am likewise using the configuration from the sophia config for both methods, particularly min_lr = 1.5e-5 which is smaller than the 3e-5 min lr in the adam file, and the beta values.
Could you suggest any changes/ablations or modifications I should do that might help here? I can run any additional hyper-parameter combinations you suggest.