apple / ml-cvnets

CVNets: A library for training computer vision networks
https://apple.github.io/ml-cvnets
Other
1.77k stars 225 forks source link

Exception catched while training ResNet ? #9

Closed guanchzh closed 2 years ago

guanchzh commented 2 years ago

I'm trying to train the resnet classificaion model in my machine using 2 TITAN RTX. I leave all the configuration the default value in the config file except the dataset path and the log_freq. After training several iterations or several epoches, an exception was throwed as below : `2022-02-07 14:21:38 - LOGS - Epoch: 5 [ 16462/10000000], loss: 1.0149, top1: 100.0000, top5: 100.0000, LR: [0.398905, 0.398905], Avg. batch load time: 2.070, Elapsed time: 3878.27 2022-02-07 14:24:18 - LOGS - Epoch: 5 [ 16512/10000000], loss: 1.0149, top1: 100.0000, top5: 100.0000, LR: [0.398905, 0.398905], Avg. batch load time: 2.054, Elapsed time: 4038.14 2022-02-07 14:26:52 - LOGS - Epoch: 5 [ 16562/10000000], loss: 1.0332, top1: 99.9978, top5: 99.9978, LR: [0.398905, 0.398905], Avg. batch load time: 2.071, Elapsed time: 4191.72 --Call--

/home/guan/miniconda3/envs/deit/lib/python3.6/site-packages/torch/autocast_mode.py(179)exit() -> def exit(self, *args): (Pdb) 2022-02-07 14:27:03 - LOGS - Exception occurred that interrupted the training.

2022-02-07 14:27:04 - LOGS - Training took 01:10:05.55`

This problem was encounted repeatedly when I resume training many times. It is strange that the problem was not encounted while training MobileViT. Could you tell me how to solve this problem?

sacmehta commented 2 years ago

This is because the learning rate you are using is for a batch size of 1024 while the batch size in your experiments is small.

Effective Batch size = Batch size per GPU x Number of GPUs

to adjust the learning rate, you can follow linear scaling rule.

Learning rate = 1024 / effective batch size

Hope this helps!

sacmehta commented 2 years ago

No activity, so closing it