G-U-N / PyCIL

PyCIL: A Python Toolbox for Class-Incremental Learning
Other
815 stars 138 forks source link

loss is nan when run beef #64

Closed JACK-Chen-2019 closed 7 months ago

JACK-Chen-2019 commented 1 year ago

When i run beef with below configs(b0 20steps and b0 10steps) on imagenet100, the loss is nan. But it can be run in b0 5steps normally. Is beef need special params for different settings? The beef imagenet100 config is from official beef projects and only change settings and "fixed_memory".

{ "prefix": "fusion-energy-0.01-1.7-fixed", "dataset": "imagenet100", "memory_size": 2000, "memory_per_class": 20, "fixed_memory": false, "shuffle": true, "init_cls": 10, "increment": 10, "model_name": "beefiso", "convnet_type": "resnet18", "device": ["0"], "seed": [1993], "logits_alignment": 3, "energy_weight": 0.1, "is_compress":false, "init_epochs": 200, "init_lr" : 0.1, "init_weight_decay" : 5e-4, "expansion_epochs" : 120, "fusion_epochs" : 60, "lr" : 0.1, "batch_size" : 128, "weight_decay" : 5e-4, "num_workers" : 12, "reduce_batch_size": true, "T" : 2 }

2023-10-03 10:32:05,959 [trainer.py] => config: ./exps/beef-imagenet100-b0-step10.json 2023-10-03 10:32:05,959 [trainer.py] => prefix: fusion-energy-0.01-1.7-fixed 2023-10-03 10:32:05,959 [trainer.py] => dataset: imagenet100 2023-10-03 10:32:05,959 [trainer.py] => memory_size: 2000 2023-10-03 10:32:05,959 [trainer.py] => memory_per_class: 20 2023-10-03 10:32:05,959 [trainer.py] => fixed_memory: False 2023-10-03 10:32:05,959 [trainer.py] => shuffle: True 2023-10-03 10:32:05,959 [trainer.py] => init_cls: 10 2023-10-03 10:32:05,959 [trainer.py] => increment: 10 2023-10-03 10:32:05,959 [trainer.py] => model_name: beefiso 2023-10-03 10:32:05,959 [trainer.py] => convnet_type: resnet18 2023-10-03 10:32:05,959 [trainer.py] => device: [device(type='cuda', index=0)] 2023-10-03 10:32:05,959 [trainer.py] => seed: 1993 2023-10-03 10:32:05,959 [trainer.py] => logits_alignment: 3 2023-10-03 10:32:05,959 [trainer.py] => energy_weight: 0.1 2023-10-03 10:32:05,959 [trainer.py] => is_compress: False 2023-10-03 10:32:05,959 [trainer.py] => init_epochs: 200 2023-10-03 10:32:05,959 [trainer.py] => init_lr: 0.1 2023-10-03 10:32:05,959 [trainer.py] => init_weight_decay: 0.0005 2023-10-03 10:32:05,959 [trainer.py] => expansion_epochs: 120 2023-10-03 10:32:05,959 [trainer.py] => fusion_epochs: 60 2023-10-03 10:32:05,959 [trainer.py] => lr: 0.1 2023-10-03 10:32:05,959 [trainer.py] => batch_size: 128 2023-10-03 10:32:05,959 [trainer.py] => weight_decay: 0.0005 2023-10-03 10:32:05,959 [trainer.py] => num_workers: 12 2023-10-03 10:32:05,959 [trainer.py] => reduce_batch_size: True 2023-10-03 10:32:05,959 [trainer.py] => T: 2 2023-10-03 10:32:06,251 [data_manager.py] => [68, 56, 78, 8, 23, 84, 90, 65, 74, 76, 40, 89, 3, 92, 55, 9, 26, 80, 43, 38, 58, 70, 77, 1, 85, 19, 17, 50, 28, 53, 13, 81, 45, 82, 6, 59, 83, 16, 15, 44, 91, 41, 72, 60, 79, 52, 20, 10, 31, 54, 37, 95, 14, 71, 96, 98, 97, 2, 64, 66, 42, 22, 35, 86, 24, 34, 87, 21, 99, 0, 88, 27, 18, 94, 11, 12, 47, 25, 30, 46, 62, 69, 36, 61, 7, 63, 75, 5, 32, 4, 51, 48, 73, 93, 39, 67, 29, 49, 57, 33] 2023-10-03 10:32:06,446 [trainer.py] => All params: 0 2023-10-03 10:32:06,446 [trainer.py] => Trainable params: 0 2023-10-03 10:32:06,598 [beef_iso.py] => Learning on 0-10 2023-10-03 10:32:06,599 [beef_iso.py] => All params: 11186772 2023-10-03 10:32:06,599 [beef_iso.py] => Trainable params: 11186772 2023-10-03 10:32:51,543 [beef_iso.py] => Task 0, Epoch 1/200 => Loss nan, Loss_en nan, Train_accy 10.34, Test_accy 10.00 2023-10-03 10:33:33,308 [beef_iso.py] => Task 0, Epoch 2/200 => Loss nan, Loss_en nan, Train_accy 10.04 2023-10-03 10:34:15,339 [beef_iso.py] => Task 0, Epoch 3/200 => Loss nan, Loss_en nan, Train_accy 10.04 2023-10-03 10:34:57,416 [beef_iso.py] => Task 0, Epoch 4/200 => Loss nan, Loss_en nan, Train_accy 10.04 2023-10-03 10:35:39,575 [beef_iso.py] => Task 0, Epoch 5/200 => Loss nan, Loss_en nan, Train_accy 10.04 2023-10-03 10:36:22,981 [beef_iso.py] => Task 0, Epoch 6/200 => Loss nan, Loss_en nan, Train_accy 10.04, Test_accy 10.00 2023-10-03 10:37:04,801 [beef_iso.py] => Task 0, Epoch 7/200 => Loss nan, Loss_en nan, Train_accy 10.04 2023-10-03 10:37:46,740 [beef_iso.py] => Task 0, Epoch 8/200 => Loss nan, Loss_en nan, Train_accy 10.04 2023-10-03 10:38:28,795 [beef_iso.py] => Task 0, Epoch 9/200 => Loss nan, Loss_en nan, Train_accy 10.04 2023-10-03 10:39:10,817 [beef_iso.py] => Task 0, Epoch 10/200 => Loss nan, Loss_en nan, Train_accy 10.04 2023-10-03 10:39:54,286 [beef_iso.py] => Task 0, Epoch 11/200 => Loss nan, Loss_en nan, Train_accy 10.04, Test_accy 10.00 2023-10-03 10:40:36,189 [beef_iso.py] => Task 0, Epoch 12/200 => Loss nan, Loss_en nan, Train_accy 10.04 2023-10-03 10:41:18,549 [beef_iso.py] => Task 0, Epoch 13/200 => Loss nan, Loss_en nan, Train_accy 10.04 2023-10-03 10:42:00,521 [beef_iso.py] => Task 0, Epoch 14/200 => Loss nan, Loss_en nan, Train_accy 10.04 2023-10-03 10:42:42,680 [beef_iso.py] => Task 0, Epoch 15/200 => Loss nan, Loss_en nan, Train_accy 10.04 2023-10-03 10:43:26,211 [beef_iso.py] => Task 0, Epoch 16/200 => Loss nan, Loss_en nan, Train_accy 10.04, Test_accy 10.00

chenxiang3luo commented 9 months ago

When i run beef with below configs(b0 20steps and b0 10steps) on imagenet100, the loss is nan. But it can be run in b0 5steps normally. Is beef need special params for different settings? The beef imagenet100 config is from official beef projects and only change settings and "fixed_memory".

{ "prefix": "fusion-energy-0.01-1.7-fixed", "dataset": "imagenet100", "memory_size": 2000, "memory_per_class": 20, "fixed_memory": false, "shuffle": true, "init_cls": 10, "increment": 10, "model_name": "beefiso", "convnet_type": "resnet18", "device": ["0"], "seed": [1993], "logits_alignment": 3, "energy_weight": 0.1, "is_compress":false, "init_epochs": 200, "init_lr" : 0.1, "init_weight_decay" : 5e-4, "expansion_epochs" : 120, "fusion_epochs" : 60, "lr" : 0.1, "batch_size" : 128, "weight_decay" : 5e-4, "num_workers" : 12, "reduce_batch_size": true, "T" : 2 }

2023-10-03 10:32:05,959 [trainer.py] => config: ./exps/beef-imagenet100-b0-step10.json 2023-10-03 10:32:05,959 [trainer.py] => prefix: fusion-energy-0.01-1.7-fixed 2023-10-03 10:32:05,959 [trainer.py] => dataset: imagenet100 2023-10-03 10:32:05,959 [trainer.py] => memory_size: 2000 2023-10-03 10:32:05,959 [trainer.py] => memory_per_class: 20 2023-10-03 10:32:05,959 [trainer.py] => fixed_memory: False 2023-10-03 10:32:05,959 [trainer.py] => shuffle: True 2023-10-03 10:32:05,959 [trainer.py] => init_cls: 10 2023-10-03 10:32:05,959 [trainer.py] => increment: 10 2023-10-03 10:32:05,959 [trainer.py] => model_name: beefiso 2023-10-03 10:32:05,959 [trainer.py] => convnet_type: resnet18 2023-10-03 10:32:05,959 [trainer.py] => device: [device(type='cuda', index=0)] 2023-10-03 10:32:05,959 [trainer.py] => seed: 1993 2023-10-03 10:32:05,959 [trainer.py] => logits_alignment: 3 2023-10-03 10:32:05,959 [trainer.py] => energy_weight: 0.1 2023-10-03 10:32:05,959 [trainer.py] => is_compress: False 2023-10-03 10:32:05,959 [trainer.py] => init_epochs: 200 2023-10-03 10:32:05,959 [trainer.py] => init_lr: 0.1 2023-10-03 10:32:05,959 [trainer.py] => init_weight_decay: 0.0005 2023-10-03 10:32:05,959 [trainer.py] => expansion_epochs: 120 2023-10-03 10:32:05,959 [trainer.py] => fusion_epochs: 60 2023-10-03 10:32:05,959 [trainer.py] => lr: 0.1 2023-10-03 10:32:05,959 [trainer.py] => batch_size: 128 2023-10-03 10:32:05,959 [trainer.py] => weight_decay: 0.0005 2023-10-03 10:32:05,959 [trainer.py] => num_workers: 12 2023-10-03 10:32:05,959 [trainer.py] => reduce_batch_size: True 2023-10-03 10:32:05,959 [trainer.py] => T: 2 2023-10-03 10:32:06,251 [data_manager.py] => [68, 56, 78, 8, 23, 84, 90, 65, 74, 76, 40, 89, 3, 92, 55, 9, 26, 80, 43, 38, 58, 70, 77, 1, 85, 19, 17, 50, 28, 53, 13, 81, 45, 82, 6, 59, 83, 16, 15, 44, 91, 41, 72, 60, 79, 52, 20, 10, 31, 54, 37, 95, 14, 71, 96, 98, 97, 2, 64, 66, 42, 22, 35, 86, 24, 34, 87, 21, 99, 0, 88, 27, 18, 94, 11, 12, 47, 25, 30, 46, 62, 69, 36, 61, 7, 63, 75, 5, 32, 4, 51, 48, 73, 93, 39, 67, 29, 49, 57, 33] 2023-10-03 10:32:06,446 [trainer.py] => All params: 0 2023-10-03 10:32:06,446 [trainer.py] => Trainable params: 0 2023-10-03 10:32:06,598 [beef_iso.py] => Learning on 0-10 2023-10-03 10:32:06,599 [beef_iso.py] => All params: 11186772 2023-10-03 10:32:06,599 [beef_iso.py] => Trainable params: 11186772 2023-10-03 10:32:51,543 [beef_iso.py] => Task 0, Epoch 1/200 => Loss nan, Loss_en nan, Train_accy 10.34, Test_accy 10.00 2023-10-03 10:33:33,308 [beef_iso.py] => Task 0, Epoch 2/200 => Loss nan, Loss_en nan, Train_accy 10.04 2023-10-03 10:34:15,339 [beef_iso.py] => Task 0, Epoch 3/200 => Loss nan, Loss_en nan, Train_accy 10.04 2023-10-03 10:34:57,416 [beef_iso.py] => Task 0, Epoch 4/200 => Loss nan, Loss_en nan, Train_accy 10.04 2023-10-03 10:35:39,575 [beef_iso.py] => Task 0, Epoch 5/200 => Loss nan, Loss_en nan, Train_accy 10.04 2023-10-03 10:36:22,981 [beef_iso.py] => Task 0, Epoch 6/200 => Loss nan, Loss_en nan, Train_accy 10.04, Test_accy 10.00 2023-10-03 10:37:04,801 [beef_iso.py] => Task 0, Epoch 7/200 => Loss nan, Loss_en nan, Train_accy 10.04 2023-10-03 10:37:46,740 [beef_iso.py] => Task 0, Epoch 8/200 => Loss nan, Loss_en nan, Train_accy 10.04 2023-10-03 10:38:28,795 [beef_iso.py] => Task 0, Epoch 9/200 => Loss nan, Loss_en nan, Train_accy 10.04 2023-10-03 10:39:10,817 [beef_iso.py] => Task 0, Epoch 10/200 => Loss nan, Loss_en nan, Train_accy 10.04 2023-10-03 10:39:54,286 [beef_iso.py] => Task 0, Epoch 11/200 => Loss nan, Loss_en nan, Train_accy 10.04, Test_accy 10.00 2023-10-03 10:40:36,189 [beef_iso.py] => Task 0, Epoch 12/200 => Loss nan, Loss_en nan, Train_accy 10.04 2023-10-03 10:41:18,549 [beef_iso.py] => Task 0, Epoch 13/200 => Loss nan, Loss_en nan, Train_accy 10.04 2023-10-03 10:42:00,521 [beef_iso.py] => Task 0, Epoch 14/200 => Loss nan, Loss_en nan, Train_accy 10.04 2023-10-03 10:42:42,680 [beef_iso.py] => Task 0, Epoch 15/200 => Loss nan, Loss_en nan, Train_accy 10.04 2023-10-03 10:43:26,211 [beef_iso.py] => Task 0, Epoch 16/200 => Loss nan, Loss_en nan, Train_accy 10.04, Test_accy 10.00

Hi, I met the same problem. In this codebase, the resnet18 is bigger than resnet32 that is adapted for cifar100. Maybe you need to make the lr smaller e.g. 0.01, which works for me. Have a good day!

caoshuai888 commented 9 months ago

这是来自QQ邮箱的假期自动回复邮件。   您好,我最近正在休假中,无法亲自回复您的邮件。我将在假期结束后,尽快给您回复。