UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
14.78k stars 2.43k forks source link

fp16 training errors for mt5 #2703

Open saurabhkumar opened 3 months ago

saurabhkumar commented 3 months ago

Training using google/mt5-base as the base model with fp16 and the triplet loss on all-nli data (following the example with trainer leads to errors as the loss is zero and grad_norm is nan. {'loss': 0.0, 'grad_norm': nan, 'learning_rate': 0.0, 'epoch': 0.0} There were similar problems long ago with these models as mentioned here. I want to try this on a GPU without bf16 support. @tomaarsen : It works for google-t5/t5-base so the fix has been applied to that model. Would it be possible to request for help from someone at Hugging Face to apply the same fix to mt5 (or possibly get information from someone in the aya-101 team if they tried it).

tomaarsen commented 3 months ago

Hello!

I'm glad to hear that it does work for google-t5/t5-base, so I agree with you that we're probably dealing with the issue from https://discuss.huggingface.co/t/t5-fp16-issue-is-fixed/3139.

I found out that the original fix was this: https://github.com/huggingface/transformers/pull/9487 These changes have also been propagated to mt5: https://github.com/huggingface/transformers/blob/main/src/transformers/models/mt5/modeling_mt5.py#L576

But there's indeed some folks that still report issues. It's an issue that would need to be fixed in transformers I think. Alternatively, you can use bf16=True if your GPU supports that, or train with full precision instead.