Closed Kishaan closed 2 years ago
On a different note, I'm getting this issue where after training first task (again with the same setting as above and 2 epochs per task), I'm getting this error in bce_with_logits function.
/volumes1/Home/anaconda3/envs/timm-env/bin/python /volumes2/Other/dytox/main.py --options options/data/cifar100_10-10.yaml options/data/cifar100_order1.yaml options/model/cifar_dytox.yaml --name dytox --data-path logs/ --output-basedir outputs/ --patch-size 4 --epochs 2 --base-epochs 2
Not using distributed mode
Namespace(aa='rand-m9-mstd0.5-inc1', auto_kd=True, base_epochs=2, batch_size=128, bce_loss=True, class_attention=True, class_order=[87, 0, 52, 58, 44, 91, 68, 97, 51, 15, 94, 92, 10, 72, 49, 78, 61, 14, 8, 86, 84, 96, 18, 24, 32, 45, 88, 11, 4, 67, 69, 66, 77, 47, 79, 93, 29, 50, 57, 83, 17, 81, 41, 12, 37, 59, 25, 20, 80, 73, 1, 28, 6, 46, 62, 82, 53, 9, 31, 75, 38, 63, 33, 74, 27, 22, 36, 3, 16, 21, 60, 19, 70, 90, 89, 43, 5, 42, 65, 76, 40, 30, 23, 85, 2, 95, 56, 48, 71, 64, 98, 13, 99, 7, 34, 55, 54, 26, 35, 39], clip_grad=None, color_jitter=0.4, cooldown_epochs=10, cutmix=0.0, cutmix_minmax=None, data_path='logs/', data_set='CIFAR', debug=False, decay_epochs=30, decay_rate=0.1, depth=6, device='cuda', dist_eval=False, dist_url='env://', distillation_tau=1.0, distributed=False, drop=0.0, drop_path=0.1, dytox=True, embed_dim=384, epochs=2, eval=False, eval_every=50, finetuning='balanced', finetuning_epochs=20, finetuning_lr=5e-05, finetuning_resetclf=False, finetuning_teacher=False, fixed_memory=False, freeze_eval=False, freeze_ft=['sab'], freeze_task=['old_task_tokens', 'old_heads'], head_div=0.1, head_div_mode='tr', inat_category='name', increment=10, incremental_batch_size=128, incremental_lr=0.0005, incremental_warmup_lr=None, ind_clf='1-1', initial_increment=10, input_size=32, joint_tokens=False, local_rank=None, local_up_to_layer=5, locality_strength=1.0, log_category='10-10', log_dir='logs/cifar/10-10/22-03/week-4/25_dytox', log_path='logs', look_sam_alpha=0.7, look_sam_k=0, lr=0.0005, lr_noise=None, lr_noise_pct=0.67, lr_noise_std=1.0, max_task=None, memory_size=2000, min_lr=1e-05, mixup=0.0, mixup_mode='batch', mixup_prob=1.0, mixup_switch_prob=0.5, model='convit', momentum=0.9, name='dytox', no_amp=True, norm='layer', num_heads=12, num_workers=0, only_ft=False, opt='adamw', opt_betas=None, opt_eps=1e-08, options=['options/data/cifar100_10-10.yaml', 'options/data/cifar100_order1.yaml', 'options/model/cifar_dytox.yaml'], output_basedir='outputs/', output_dir='', patch_size=4, patience_epochs=10, pin_mem=True, recount=1, rehearsal='icarl_all', remode='pixel', repeated_aug=True, replay_memory=0, reprob=0.0, resplit=False, resume='', sam_adaptive=False, sam_div='', sam_final=None, sam_first='main', sam_mode=['tr', 'ft'], sam_rho=0.0, sam_second='main', sam_skip_first=False, save_every_epoch=None, sched='cosine', seed=0, sep_memory=False, smoothing=0.1, start_epoch=0, start_task=0, train_interpolation='bicubic', trial_id=1, validation=0.0, warmup_epochs=5, warmup_lr=1e-06, weight_decay=1e-06, world_size=1)
Files already downloaded and verified
Files already downloaded and verified
Creating model: convit
kdytox\
number of params: 10689334
Starting task id 0/9
Creating DyTox!
Adding new parameters
Start training for 2 epochs
Image size is torch.Size([128, 3, 32, 32]).
Task: [0] Epoch: [0] [ 0/39] eta: 0:00:20 lr: 0.000001 loss: 0.6905 (0.6905) time: 0.5381 data: 0.0460 max mem: 1854
Task: [0] Epoch: [0] [10/39] eta: 0:00:04 lr: 0.000001 loss: 0.6784 (0.6763) time: 0.1629 data: 0.0450 max mem: 1982
Task: [0] Epoch: [0] [20/39] eta: 0:00:02 lr: 0.000001 loss: 0.6524 (0.6549) time: 0.1247 data: 0.0444 max mem: 1982
Task: [0] Epoch: [0] [30/39] eta: 0:00:01 lr: 0.000001 loss: 0.6127 (0.6364) time: 0.1240 data: 0.0442 max mem: 1982
Task: [0] Epoch: [0] [38/39] eta: 0:00:00 lr: 0.000001 loss: 0.5865 (0.6195) time: 0.1251 data: 0.0455 max mem: 1982
Task: [0] Epoch: [0] Total time: 0:00:05 (0.1355 s / it)
Averaged stats: lr: 0.000001 loss: 0.5865 (0.6195)
Test: [0/6] eta: 0:00:00 loss: 2.2770 (2.2770) acc1: 14.0625 (14.0625) acc5: 59.3750 (59.3750) time: 0.1299 data: 0.0595 max mem: 1982
Test: [5/6] eta: 0:00:00 loss: 2.2770 (2.2788) acc1: 15.6250 (15.9000) acc5: 59.3750 (58.5000) time: 0.0570 data: 0.0255 max mem: 1982
Test: Total time: 0:00:00 (0.0571 s / it)
* Acc@1 15.900 loss 2.279
Accuracy of the network on the 1000 test images: 15.9%
Max accuracy: 15.90%
Image size is torch.Size([128, 3, 32, 32]).
Task: [0] Epoch: [1] [ 0/39] eta: 0:00:04 lr: 0.000001 loss: 0.5388 (0.5388) time: 0.1256 data: 0.0424 max mem: 1982
Task: [0] Epoch: [1] [10/39] eta: 0:00:03 lr: 0.000001 loss: 0.5310 (0.5286) time: 0.1236 data: 0.0427 max mem: 1982
Task: [0] Epoch: [1] [20/39] eta: 0:00:02 lr: 0.000001 loss: 0.5139 (0.5189) time: 0.1225 data: 0.0424 max mem: 1982
Task: [0] Epoch: [1] [30/39] eta: 0:00:01 lr: 0.000001 loss: 0.4955 (0.5078) time: 0.1230 data: 0.0434 max mem: 1982
Task: [0] Epoch: [1] [38/39] eta: 0:00:00 lr: 0.000001 loss: 0.4796 (0.5000) time: 0.1236 data: 0.0438 max mem: 1982
Task: [0] Epoch: [1] Total time: 0:00:04 (0.1233 s / it)
Averaged stats: lr: 0.000001 loss: 0.4796 (0.5000)
Test: [0/6] eta: 0:00:00 loss: 2.2385 (2.2385) acc1: 18.2292 (18.2292) acc5: 65.6250 (65.6250) time: 0.0507 data: 0.0262 max mem: 1982
Test: [5/6] eta: 0:00:00 loss: 2.2265 (2.2373) acc1: 18.2292 (18.1000) acc5: 62.5000 (63.2000) time: 0.0432 data: 0.0218 max mem: 1982
Test: Total time: 0:00:00 (0.0433 s / it)
* Acc@1 18.100 loss 2.237
Accuracy of the network on the 1000 test images: 18.1%
Max accuracy: 18.10%
Test: [0/6] eta: 0:00:00 loss: 2.2385 (2.2385) acc1: 18.2292 (18.2292) acc5: 65.6250 (65.6250) time: 0.0794 data: 0.0548 max mem: 1982
Test: [5/6] eta: 0:00:00 loss: 2.2265 (2.2373) acc1: 18.2292 (18.1000) acc5: 62.5000 (63.2000) time: 0.0465 data: 0.0249 max mem: 1982
Test: Total time: 0:00:00 (0.0466 s / it)
* Acc@1 18.100 loss 2.237
Accuracy of the network on the 1000 test images: 18.1%
Max accuracy: 18.10%
Starting task id 1/9
2000 samples added from memory.
Updating ensemble, new embed dim 384.
Adding new parameters
Start training for 2 epochs
Image size is torch.Size([128, 3, 32, 32]).
Traceback (most recent call last):
File "/volumes2/Other/dytox/main.py", line 733, in <module>
main(args)
File "/volumes2/Other/dytox/main.py", line 540, in main
train_stats = train_one_epoch(
File "/volumes2/Other/dytox/continual/engine.py", line 60, in train_one_epoch
loss_tuple = forward(samples, targets, model, teacher_model, criterion, lam, args)
File "/volumes2/Other/dytox/continual/engine.py", line 162, in forward
loss = criterion(main_output, targets) # bce_with_logits
File "/volumes2/Other/dytox/continual/losses.py", line 71, in bce_with_logits
torch.eye(x.shape[1])[y].to(y.device)
IndexError: index 12 is out of bounds for dimension 0 with size 10
Process finished with exit code 1
Hey,
Thanks for your interest in my work :)
Both of your errors came from the fact that you didn't use the train.sh
script as I advise in the readme, and thus you were not using distributed mode (which works very well on a single gpu also).
I've made a few modifications to the code base so it can also be used with main.py
.
Don't hesitate to read the readme if you find errors, they are probably answered there. If not, please open an issue :)
Hi,
I'm running your code for CIFAR-100 with Convit backbone (as suggested in the Readme file. I'm running into this error when the rehearsal memory is being updated.
I'm running for 2 epochs per task (to understand the structure of the code), and these are the arguments I'm using
--options options/data/cifar100_10-10.yaml options/data/cifar100_order1.yaml options/model/cifar_dytox.yaml --name dytox --data-path logs/ --output-basedir outputs/ --patch-size 4 --epochs 2 --base-epochs 2
I also noticed that the training and validation loops always use the Classification head inside ConVit and does not use the ContinualClassifier inside the dytox.py. Is that expected?
After first task, ConVit's Classifier weights are changed (when compared to the initialized weights), but Dytox module's ContinualClassifier still has the same weights and these unchanged weights are freezed before the second task. I was expecting the ConVit's weights to be copied to ContinualClassifier after every task. In short, I would like to know how you save the updated classifier weights of previous task before moving to the next task.
Any clarification regarding this would be much helpful! Thank you!