badripatro / simba

Simba
156 stars 15 forks source link

Always got nan loss while token_label is False. #6

Open lqniunjunlper opened 3 months ago

lqniunjunlper commented 3 months ago

[03/29 11:44:29] train INFO: Test: [520/521] eta: 0:00:00 loss: 6.7130 (6.8320) acc1: 0.0000 (0.5440) acc5: 0.0000 (2.0020) time: 0.4453 data: 0.0002 max mem: 15878 [03/29 11:44:29] train INFO: Test: Total time: 0:04:02 (0.4651 s / it) [03/29 11:44:35] train INFO: * Acc@1 0.544 Acc@5 2.002 loss 6.832 [03/29 11:44:37] train INFO: *** Best metric: 0.5440000167846679 (epoch 1) [03/29 11:44:37] train INFO: Accuracy of the network on the 50000 test images: 0.5% [03/29 11:44:37] train INFO: Max accuracy: 0.54% [03/29 11:44:37] train INFO: {'train_lr': 9.999999999999953e-07, 'train_loss': 6.907856970549011, 'test_loss': 6.832010622845959, 'test_acc1': 0.5440000167846679, 'test_acc5': 2.0021250657749174, 'epoch': 1, 'n_parameters': 66249512} [03/29 11:44:44] train INFO: Epoch: [2] [ 0/1251] eta: 2:18:16 lr: 0.000401 loss: 6.9199 (6.9199) time: 6.6321 data: 4.9419 max mem: 15878 [03/29 11:45:01] train INFO: Epoch: [2] [ 10/1251] eta: 0:44:41 lr: 0.000401 loss: 6.9639 (6.9749) time: 2.1604 data: 0.4495 max mem: 15878 [03/29 11:45:18] train INFO: Epoch: [2] [ 20/1251] eta: 0:39:54 lr: 0.000401 loss: 6.9639 (6.9806) time: 1.7107 data: 0.0003 max mem: 15878 [03/29 11:45:35] train INFO: Epoch: [2] [ 30/1251] eta: 0:38:00 lr: 0.000401 loss: 7.0065 (6.9880) time: 1.7071 data: 0.0002 max mem: 15878 [03/29 11:45:49] train INFO: Loss is nan, stopping training

Ale0311 commented 3 months ago

Hello! Where did you download the token-labels _imagenet_efficientnet_l2_sz475top5 from? Thanks!

Ale0311 commented 3 months ago

Are you training on multiple GPUs? If you managed to get the checkpoint, can you please share it with me? Thanks!

EasonXiao-888 commented 1 month ago

@lqniunjunlper Hello, I also encounter this problem and would like to know how you solved it.

lqniunjunlper commented 1 month ago

@EasonXiao-888 When token_label is setting to False, the loss is always nan. Later i downloaded the token label datasets, but the training is still unstable (at the middle step of training process, loss increased suddenly). image I have tried different training setup including model size and batch size, most of the training is failed. I only got one success, part of the log is following: image

EasonXiao-888 commented 1 month ago

Thank you very much for your reply. I would like to know if the results provided by Simba in the paper use token_lable?