Closed guanchzh closed 2 years ago
This is because the learning rate you are using is for a batch size of 1024 while the batch size in your experiments is small.
Effective Batch size = Batch size per GPU x Number of GPUs
to adjust the learning rate, you can follow linear scaling rule.
Learning rate = 1024 / effective batch size
Hope this helps!
No activity, so closing it
I'm trying to train the resnet classificaion model in my machine using 2 TITAN RTX. I leave all the configuration the default value in the config file except the dataset path and the log_freq. After training several iterations or several epoches, an exception was throwed as below : `2022-02-07 14:21:38 - LOGS - Epoch: 5 [ 16462/10000000], loss: 1.0149, top1: 100.0000, top5: 100.0000, LR: [0.398905, 0.398905], Avg. batch load time: 2.070, Elapsed time: 3878.27 2022-02-07 14:24:18 - LOGS - Epoch: 5 [ 16512/10000000], loss: 1.0149, top1: 100.0000, top5: 100.0000, LR: [0.398905, 0.398905], Avg. batch load time: 2.054, Elapsed time: 4038.14 2022-02-07 14:26:52 - LOGS - Epoch: 5 [ 16562/10000000], loss: 1.0332, top1: 99.9978, top5: 99.9978, LR: [0.398905, 0.398905], Avg. batch load time: 2.071, Elapsed time: 4191.72 --Call--
2022-02-07 14:27:04 - LOGS - Training took 01:10:05.55`
This problem was encounted repeatedly when I resume training many times. It is strange that the problem was not encounted while training MobileViT. Could you tell me how to solve this problem?