JonasGeiping / cramming

Cramming the training of a (BERT-type) language model into limited compute.
MIT License
1.3k stars 100 forks source link

GLUE evaluation numbers are very poor, if increase the sequence length to 512 and float 32 #28

Closed tbaggu closed 1 year ago

tbaggu commented 1 year ago

Hi

I am trying to do some bench-marking as part of my experiments i want train BERT model with 512 sequence length and dtype as float 32 , i have pre trained the model wth above configuration and run the evaluation on glue_sne but the numbers are very poor.

May i know what went wrong

JonasGeiping commented 1 year ago

Hm, I haven't trained with that combination ever, hard to say what is going on. One sanity check question, during downstream finetuning, is the model also set to float32?

tbaggu commented 1 year ago

Yes,

Get Outlook for Androidhttps://aka.ms/AAb9ysg


From: Jonas Geiping @.> Sent: Tuesday, July 4, 2023 12:50:28 AM To: JonasGeiping/cramming @.> Cc: Tirupathi Rao Baggu @.>; Author @.> Subject: Re: [JonasGeiping/cramming] GLUE evaluation numbers are very poor, if increase the sequence length to 512 and float 32 (Issue #28)

Caution: This email originated from outside of the organization. Please take care when clicking links or opening attachments. When in doubt, contact your IT Department

Hm, I haven't trained with that combination ever, hard to say what is going on. One sanity check question, during downstream finetuning, is the model also set to float32?

— Reply to this email directly, view it on GitHubhttps://github.com/JonasGeiping/cramming/issues/28#issuecomment-1619039593, or unsubscribehttps://github.com/notifications/unsubscribe-auth/A4BNIQIISJMNBTZLHSQU4NTXOMLPZANCNFSM6AAAAAAZ35IXDE. You are receiving this because you authored the thread.Message ID: @.***>

JonasGeiping commented 1 year ago

Hm just a note: There is also a separate max_seq_length setting in eval that is set only to 128 by default. Setting this to a lower number shouldn't make things much worse though

tbaggu commented 1 year ago

I have pre trained and fine turned for different settings and observed steps vs loss for these settings, in case of 512 sequence length with fp32 ,I can see spikes in loss, I think it's due to learning rate, I have just reduced the lr, and re training, let me see how the results would be

Get Outlook for Androidhttps://aka.ms/AAb9ysg


From: Jonas Geiping @.> Sent: Tuesday, July 4, 2023 7:51:31 PM To: JonasGeiping/cramming @.> Cc: Tirupathi Rao Baggu @.>; Author @.> Subject: Re: [JonasGeiping/cramming] GLUE evaluation numbers are very poor, if increase the sequence length to 512 and float 32 (Issue #28)

Caution: This email originated from outside of the organization. Please take care when clicking links or opening attachments. When in doubt, contact your IT Department

Hm just a note: There is also a separate max_seq_length setting in eval that is set only to 128 by default. Setting this to a lower number shouldn't make things much worse though

— Reply to this email directly, view it on GitHubhttps://github.com/JonasGeiping/cramming/issues/28#issuecomment-1620336838, or unsubscribehttps://github.com/notifications/unsubscribe-auth/A4BNIQP54IS6CEMKCCYSU4DXOQRGXANCNFSM6AAAAAAZ35IXDE. You are receiving this because you authored the thread.Message ID: @.***>

JonasGeiping commented 1 year ago

Closing this for now, cannot reproduce. Let me know if you find the source for this potential discrepancy.