arthurdouillard / dytox

Dynamic Token Expansion with Continual Transformers, accepted at CVPR 2022
https://arxiv.org/abs/2111.11326
Apache License 2.0
134 stars 17 forks source link

convit object has no attribute 'module' #1

Closed Kishaan closed 2 years ago

Kishaan commented 2 years ago

Hi,

I'm running your code for CIFAR-100 with Convit backbone (as suggested in the Readme file. I'm running into this error when the rehearsal memory is being updated.

/volumes1/Home/anaconda3/envs/timm-env/bin/python /volumes2/Other/dytox/main.py --options options/data/cifar100_10-10.yaml options/data/cifar100_order1.yaml options/model/cifar_dytox.yaml --name dytox --data-path logs/ --output-basedir outputs/ --patch-size 4 --epochs 2 --base-epochs 2
Not using distributed mode
Namespace(aa='rand-m9-mstd0.5-inc1', auto_kd=True, base_epochs=2, batch_size=128, bce_loss=True, class_attention=True, class_order=[87, 0, 52, 58, 44, 91, 68, 97, 51, 15, 94, 92, 10, 72, 49, 78, 61, 14, 8, 86, 84, 96, 18, 24, 32, 45, 88, 11, 4, 67, 69, 66, 77, 47, 79, 93, 29, 50, 57, 83, 17, 81, 41, 12, 37, 59, 25, 20, 80, 73, 1, 28, 6, 46, 62, 82, 53, 9, 31, 75, 38, 63, 33, 74, 27, 22, 36, 3, 16, 21, 60, 19, 70, 90, 89, 43, 5, 42, 65, 76, 40, 30, 23, 85, 2, 95, 56, 48, 71, 64, 98, 13, 99, 7, 34, 55, 54, 26, 35, 39], clip_grad=None, color_jitter=0.4, cooldown_epochs=10, cutmix=0.0, cutmix_minmax=None, data_path='logs/', data_set='CIFAR', debug=False, decay_epochs=30, decay_rate=0.1, depth=6, device='cuda', dist_eval=False, dist_url='env://', distillation_tau=1.0, distributed=False, drop=0.0, drop_path=0.1, dytox=True, embed_dim=384, epochs=2, eval=False, eval_every=50, finetuning='balanced', finetuning_epochs=20, finetuning_lr=5e-05, finetuning_resetclf=False, finetuning_teacher=False, fixed_memory=False, freeze_eval=False, freeze_ft=['sab'], freeze_task=['old_task_tokens', 'old_heads'], head_div=0.1, head_div_mode='tr', inat_category='name', increment=10, incremental_batch_size=128, incremental_lr=0.0005, incremental_warmup_lr=None, ind_clf='1-1', initial_increment=10, input_size=32, joint_tokens=False, local_rank=None, local_up_to_layer=5, locality_strength=1.0, log_category='10-10', log_dir='logs/cifar/10-10/22-03/week-4/25_dytox', log_path='logs', look_sam_alpha=0.7, look_sam_k=0, lr=0.0005, lr_noise=None, lr_noise_pct=0.67, lr_noise_std=1.0, max_task=None, memory_size=2000, min_lr=1e-05, mixup=0.0, mixup_mode='batch', mixup_prob=1.0, mixup_switch_prob=0.5, model='convit', momentum=0.9, name='dytox', no_amp=True, norm='layer', num_heads=12, num_workers=0, only_ft=False, opt='adamw', opt_betas=None, opt_eps=1e-08, options=['options/data/cifar100_10-10.yaml', 'options/data/cifar100_order1.yaml', 'options/model/cifar_dytox.yaml'], output_basedir='outputs/', output_dir='', patch_size=4, patience_epochs=10, pin_mem=True, recount=1, rehearsal='icarl_all', remode='pixel', repeated_aug=True, replay_memory=0, reprob=0.0, resplit=False, resume='', sam_adaptive=False, sam_div='', sam_final=None, sam_first='main', sam_mode=['tr', 'ft'], sam_rho=0.0, sam_second='main', sam_skip_first=False, save_every_epoch=None, sched='cosine', seed=0, sep_memory=False, smoothing=0.1, start_epoch=0, start_task=0, train_interpolation='bicubic', trial_id=1, validation=0.0, warmup_epochs=5, warmup_lr=1e-06, weight_decay=1e-06, world_size=1)
Files already downloaded and verified
Files already downloaded and verified
Creating model: convit
kdytox\
number of params: 10689334
Starting task id 0/9
Creating DyTox!
Adding new parameters
Start training for 2 epochs
Image size is torch.Size([128, 3, 32, 32]).
Task: [0] Epoch: [0]  [ 0/39]  eta: 0:00:20  lr: 0.000001  loss: 0.6984 (0.6984)  time: 0.5359  data: 0.0451  max mem: 1854
Task: [0] Epoch: [0]  [10/39]  eta: 0:00:04  lr: 0.000001  loss: 0.6794 (0.6781)  time: 0.1599  data: 0.0444  max mem: 1982
Task: [0] Epoch: [0]  [20/39]  eta: 0:00:02  lr: 0.000001  loss: 0.6494 (0.6559)  time: 0.1219  data: 0.0436  max mem: 1982
Task: [0] Epoch: [0]  [30/39]  eta: 0:00:01  lr: 0.000001  loss: 0.6122 (0.6364)  time: 0.1215  data: 0.0430  max mem: 1982
Task: [0] Epoch: [0]  [38/39]  eta: 0:00:00  lr: 0.000001  loss: 0.5855 (0.6199)  time: 0.1216  data: 0.0432  max mem: 1982
Task: [0] Epoch: [0] Total time: 0:00:05 (0.1325 s / it)
Averaged stats: lr: 0.000001  loss: 0.5855 (0.6199)
Test:  [0/6]  eta: 0:00:00  loss: 2.2756 (2.2756)  acc1: 13.5417 (13.5417)  acc5: 59.8958 (59.8958)  time: 0.1295  data: 0.0597  max mem: 1982
Test:  [5/6]  eta: 0:00:00  loss: 2.2756 (2.2783)  acc1: 15.1042 (15.5000)  acc5: 59.3750 (58.5000)  time: 0.0584  data: 0.0277  max mem: 1982
Test: Total time: 0:00:00 (0.0585 s / it)
* Acc@1 15.500  loss 2.278
Accuracy of the network on the 1000 test images: 15.5%
Max accuracy: 15.50%
Image size is torch.Size([128, 3, 32, 32]).
Task: [0] Epoch: [1]  [ 0/39]  eta: 0:00:04  lr: 0.000001  loss: 0.5338 (0.5338)  time: 0.1238  data: 0.0449  max mem: 1982
Task: [0] Epoch: [1]  [10/39]  eta: 0:00:03  lr: 0.000001  loss: 0.5266 (0.5255)  time: 0.1227  data: 0.0442  max mem: 1982
Task: [0] Epoch: [1]  [20/39]  eta: 0:00:02  lr: 0.000001  loss: 0.5165 (0.5164)  time: 0.1216  data: 0.0432  max mem: 1982
Task: [0] Epoch: [1]  [30/39]  eta: 0:00:01  lr: 0.000001  loss: 0.4914 (0.5060)  time: 0.1222  data: 0.0434  max mem: 1982
Task: [0] Epoch: [1]  [38/39]  eta: 0:00:00  lr: 0.000001  loss: 0.4786 (0.4984)  time: 0.1225  data: 0.0436  max mem: 1982
Task: [0] Epoch: [1] Total time: 0:00:04 (0.1222 s / it)
Averaged stats: lr: 0.000001  loss: 0.4786 (0.4984)
Test:  [0/6]  eta: 0:00:00  loss: 2.2362 (2.2362)  acc1: 18.2292 (18.2292)  acc5: 65.6250 (65.6250)  time: 0.0690  data: 0.0447  max mem: 1982
Test:  [5/6]  eta: 0:00:00  loss: 2.2275 (2.2374)  acc1: 18.2292 (18.1000)  acc5: 62.5000 (62.7000)  time: 0.0479  data: 0.0259  max mem: 1982
Test: Total time: 0:00:00 (0.0480 s / it)
* Acc@1 18.100  loss 2.237
Accuracy of the network on the 1000 test images: 18.1%
Max accuracy: 18.10%
Traceback (most recent call last):
  File "/volumes2/Other/dytox/main.py", line 733, in <module>
    main(args)
  File "/volumes2/Other/dytox/main.py", line 590, in main
    memory.add(scenario_train[task_id], model, args.initial_increment if task_id == 0 else args.increment)
  File "/volumes2/Other/dytox/continual/rehearsal.py", line 68, in add
    x, y, t = herd_samples(dataset, model, self.memory_per_class, self.rehearsal)
  File "/volumes2/Other/dytox/continual/rehearsal.py", line 146, in herd_samples
    features, targets = extract_features(dataset, model, handling)
  File "/volumes2/Other/dytox/continual/rehearsal.py", line 181, in extract_features
    feats, _, _ = model.module.forward_features(x.cuda())
  File "/volumes1/Home/anaconda3/envs/timm-env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 947, in __getattr__
    raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'ConVit' object has no attribute 'module'

Process finished with exit code 1

I'm running for 2 epochs per task (to understand the structure of the code), and these are the arguments I'm using

--options options/data/cifar100_10-10.yaml options/data/cifar100_order1.yaml options/model/cifar_dytox.yaml --name dytox --data-path logs/ --output-basedir outputs/ --patch-size 4 --epochs 2 --base-epochs 2

I also noticed that the training and validation loops always use the Classification head inside ConVit and does not use the ContinualClassifier inside the dytox.py. Is that expected?

After first task, ConVit's Classifier weights are changed (when compared to the initialized weights), but Dytox module's ContinualClassifier still has the same weights and these unchanged weights are freezed before the second task. I was expecting the ConVit's weights to be copied to ContinualClassifier after every task. In short, I would like to know how you save the updated classifier weights of previous task before moving to the next task.

Any clarification regarding this would be much helpful! Thank you!

Kishaan commented 2 years ago

On a different note, I'm getting this issue where after training first task (again with the same setting as above and 2 epochs per task), I'm getting this error in bce_with_logits function.

/volumes1/Home/anaconda3/envs/timm-env/bin/python /volumes2/Other/dytox/main.py --options options/data/cifar100_10-10.yaml options/data/cifar100_order1.yaml options/model/cifar_dytox.yaml --name dytox --data-path logs/ --output-basedir outputs/ --patch-size 4 --epochs 2 --base-epochs 2
Not using distributed mode
Namespace(aa='rand-m9-mstd0.5-inc1', auto_kd=True, base_epochs=2, batch_size=128, bce_loss=True, class_attention=True, class_order=[87, 0, 52, 58, 44, 91, 68, 97, 51, 15, 94, 92, 10, 72, 49, 78, 61, 14, 8, 86, 84, 96, 18, 24, 32, 45, 88, 11, 4, 67, 69, 66, 77, 47, 79, 93, 29, 50, 57, 83, 17, 81, 41, 12, 37, 59, 25, 20, 80, 73, 1, 28, 6, 46, 62, 82, 53, 9, 31, 75, 38, 63, 33, 74, 27, 22, 36, 3, 16, 21, 60, 19, 70, 90, 89, 43, 5, 42, 65, 76, 40, 30, 23, 85, 2, 95, 56, 48, 71, 64, 98, 13, 99, 7, 34, 55, 54, 26, 35, 39], clip_grad=None, color_jitter=0.4, cooldown_epochs=10, cutmix=0.0, cutmix_minmax=None, data_path='logs/', data_set='CIFAR', debug=False, decay_epochs=30, decay_rate=0.1, depth=6, device='cuda', dist_eval=False, dist_url='env://', distillation_tau=1.0, distributed=False, drop=0.0, drop_path=0.1, dytox=True, embed_dim=384, epochs=2, eval=False, eval_every=50, finetuning='balanced', finetuning_epochs=20, finetuning_lr=5e-05, finetuning_resetclf=False, finetuning_teacher=False, fixed_memory=False, freeze_eval=False, freeze_ft=['sab'], freeze_task=['old_task_tokens', 'old_heads'], head_div=0.1, head_div_mode='tr', inat_category='name', increment=10, incremental_batch_size=128, incremental_lr=0.0005, incremental_warmup_lr=None, ind_clf='1-1', initial_increment=10, input_size=32, joint_tokens=False, local_rank=None, local_up_to_layer=5, locality_strength=1.0, log_category='10-10', log_dir='logs/cifar/10-10/22-03/week-4/25_dytox', log_path='logs', look_sam_alpha=0.7, look_sam_k=0, lr=0.0005, lr_noise=None, lr_noise_pct=0.67, lr_noise_std=1.0, max_task=None, memory_size=2000, min_lr=1e-05, mixup=0.0, mixup_mode='batch', mixup_prob=1.0, mixup_switch_prob=0.5, model='convit', momentum=0.9, name='dytox', no_amp=True, norm='layer', num_heads=12, num_workers=0, only_ft=False, opt='adamw', opt_betas=None, opt_eps=1e-08, options=['options/data/cifar100_10-10.yaml', 'options/data/cifar100_order1.yaml', 'options/model/cifar_dytox.yaml'], output_basedir='outputs/', output_dir='', patch_size=4, patience_epochs=10, pin_mem=True, recount=1, rehearsal='icarl_all', remode='pixel', repeated_aug=True, replay_memory=0, reprob=0.0, resplit=False, resume='', sam_adaptive=False, sam_div='', sam_final=None, sam_first='main', sam_mode=['tr', 'ft'], sam_rho=0.0, sam_second='main', sam_skip_first=False, save_every_epoch=None, sched='cosine', seed=0, sep_memory=False, smoothing=0.1, start_epoch=0, start_task=0, train_interpolation='bicubic', trial_id=1, validation=0.0, warmup_epochs=5, warmup_lr=1e-06, weight_decay=1e-06, world_size=1)
Files already downloaded and verified
Files already downloaded and verified
Creating model: convit
kdytox\
number of params: 10689334
Starting task id 0/9
Creating DyTox!
Adding new parameters
Start training for 2 epochs
Image size is torch.Size([128, 3, 32, 32]).
Task: [0] Epoch: [0]  [ 0/39]  eta: 0:00:20  lr: 0.000001  loss: 0.6905 (0.6905)  time: 0.5381  data: 0.0460  max mem: 1854
Task: [0] Epoch: [0]  [10/39]  eta: 0:00:04  lr: 0.000001  loss: 0.6784 (0.6763)  time: 0.1629  data: 0.0450  max mem: 1982
Task: [0] Epoch: [0]  [20/39]  eta: 0:00:02  lr: 0.000001  loss: 0.6524 (0.6549)  time: 0.1247  data: 0.0444  max mem: 1982
Task: [0] Epoch: [0]  [30/39]  eta: 0:00:01  lr: 0.000001  loss: 0.6127 (0.6364)  time: 0.1240  data: 0.0442  max mem: 1982
Task: [0] Epoch: [0]  [38/39]  eta: 0:00:00  lr: 0.000001  loss: 0.5865 (0.6195)  time: 0.1251  data: 0.0455  max mem: 1982
Task: [0] Epoch: [0] Total time: 0:00:05 (0.1355 s / it)
Averaged stats: lr: 0.000001  loss: 0.5865 (0.6195)
Test:  [0/6]  eta: 0:00:00  loss: 2.2770 (2.2770)  acc1: 14.0625 (14.0625)  acc5: 59.3750 (59.3750)  time: 0.1299  data: 0.0595  max mem: 1982
Test:  [5/6]  eta: 0:00:00  loss: 2.2770 (2.2788)  acc1: 15.6250 (15.9000)  acc5: 59.3750 (58.5000)  time: 0.0570  data: 0.0255  max mem: 1982
Test: Total time: 0:00:00 (0.0571 s / it)
* Acc@1 15.900  loss 2.279
Accuracy of the network on the 1000 test images: 15.9%
Max accuracy: 15.90%
Image size is torch.Size([128, 3, 32, 32]).
Task: [0] Epoch: [1]  [ 0/39]  eta: 0:00:04  lr: 0.000001  loss: 0.5388 (0.5388)  time: 0.1256  data: 0.0424  max mem: 1982
Task: [0] Epoch: [1]  [10/39]  eta: 0:00:03  lr: 0.000001  loss: 0.5310 (0.5286)  time: 0.1236  data: 0.0427  max mem: 1982
Task: [0] Epoch: [1]  [20/39]  eta: 0:00:02  lr: 0.000001  loss: 0.5139 (0.5189)  time: 0.1225  data: 0.0424  max mem: 1982
Task: [0] Epoch: [1]  [30/39]  eta: 0:00:01  lr: 0.000001  loss: 0.4955 (0.5078)  time: 0.1230  data: 0.0434  max mem: 1982
Task: [0] Epoch: [1]  [38/39]  eta: 0:00:00  lr: 0.000001  loss: 0.4796 (0.5000)  time: 0.1236  data: 0.0438  max mem: 1982
Task: [0] Epoch: [1] Total time: 0:00:04 (0.1233 s / it)
Averaged stats: lr: 0.000001  loss: 0.4796 (0.5000)
Test:  [0/6]  eta: 0:00:00  loss: 2.2385 (2.2385)  acc1: 18.2292 (18.2292)  acc5: 65.6250 (65.6250)  time: 0.0507  data: 0.0262  max mem: 1982
Test:  [5/6]  eta: 0:00:00  loss: 2.2265 (2.2373)  acc1: 18.2292 (18.1000)  acc5: 62.5000 (63.2000)  time: 0.0432  data: 0.0218  max mem: 1982
Test: Total time: 0:00:00 (0.0433 s / it)
* Acc@1 18.100  loss 2.237
Accuracy of the network on the 1000 test images: 18.1%
Max accuracy: 18.10%
Test:  [0/6]  eta: 0:00:00  loss: 2.2385 (2.2385)  acc1: 18.2292 (18.2292)  acc5: 65.6250 (65.6250)  time: 0.0794  data: 0.0548  max mem: 1982
Test:  [5/6]  eta: 0:00:00  loss: 2.2265 (2.2373)  acc1: 18.2292 (18.1000)  acc5: 62.5000 (63.2000)  time: 0.0465  data: 0.0249  max mem: 1982
Test: Total time: 0:00:00 (0.0466 s / it)
* Acc@1 18.100  loss 2.237
Accuracy of the network on the 1000 test images: 18.1%
Max accuracy: 18.10%
Starting task id 1/9
2000 samples added from memory.
Updating ensemble, new embed dim 384.
Adding new parameters
Start training for 2 epochs
Image size is torch.Size([128, 3, 32, 32]).
Traceback (most recent call last):
  File "/volumes2/Other/dytox/main.py", line 733, in <module>
    main(args)
  File "/volumes2/Other/dytox/main.py", line 540, in main
    train_stats = train_one_epoch(
  File "/volumes2/Other/dytox/continual/engine.py", line 60, in train_one_epoch
    loss_tuple = forward(samples, targets, model, teacher_model, criterion, lam, args)
  File "/volumes2/Other/dytox/continual/engine.py", line 162, in forward
    loss = criterion(main_output, targets) # bce_with_logits
  File "/volumes2/Other/dytox/continual/losses.py", line 71, in bce_with_logits
    torch.eye(x.shape[1])[y].to(y.device)
IndexError: index 12 is out of bounds for dimension 0 with size 10

Process finished with exit code 1
arthurdouillard commented 2 years ago

Hey,

Thanks for your interest in my work :)

Both of your errors came from the fact that you didn't use the train.sh script as I advise in the readme, and thus you were not using distributed mode (which works very well on a single gpu also).

I've made a few modifications to the code base so it can also be used with main.py.

Don't hesitate to read the readme if you find errors, they are probably answered there. If not, please open an issue :)