Closed tempbrucefu closed 3 years ago
This is a common issue with larger models: For some runs, the model will diverge.
Simple solution: Just re-start training until you get a run that works / that converges
More complex solution: Have a look at this paper: https://arxiv.org/abs/2004.08249
They discuss the issue with larger transformer models and propose methods how to reduce the prob. that this happens
tried to use albert-xlarge-v2 with train_batch_size 16 on stsbenchmark, training_stsbenchmark.py for the dev evaluation, consine similarity is nan, looks the outputs are closed to constants. please note that it is fine to get results on stsb-distilbert-base but the test results are always a bit lower than the published results.