Open Abcdeabcd opened 2 months ago
Ensuring stable model training is quite critical. I suggest you check the loss curve.
Thank you very much for your answer. I will analyze from the perspective of loss curve.Is it possible that interrupting training and then resuming using the 'resume' parameter could lead to unstable convergence? And Do you think that training with a single 48GB GPU will affect the training effect?
Hello author, Thanks for your excellent work! I have some questions about code reproduction to ask you. I retrained for 16 epochs on a single 48GB GPU without using distributed training. The reproduced results were: acc:0.4411, comp:0.4156, overall:0.4284, and these metrics did not meet expectations. Is this due to training with a single GPU? Or could there be other reasons? I would greatly appreciate your answers to my questions!