Issue in Models\Architecture.py

Hello everyone, I got the following error while running the command:: python3 main.py --incremental-data True --openset-generative-replay True --dataset MNIST

Fitting Weibull models with tailsize: 300 0%| | 0/12000 [00:00<?, ?it/s]Using generative model to replay old data with openset detection 100%|██████████| 12000/12000 [04:44<00:00, 21.32it/s] Openset sampling successful. Generating dataset 100%|██████████| 94/94 [00:12<00:00, 7.76it/s] Traceback (most recent call last): File "C:/Users/RAR7ABT/pj-val-ml/pjval_ml/OSR/OCDVAE/main.py", line 328, in main() File "C:/Users/RAR7ABT/pj-val-ml/pjval_ml/OSR/OCDVAE/main.py", line 289, in main grow_classifier(model.module.classifier, args.num_increment_tasks, WeightInitializer) File "C:\Users\RAR7ABT\pj-val-ml\pjval_ml\OSR\OCDVAE\lib\Models\architectures.py", line 23, in grow_classifier classifier[-1].weight.data.size(1)) RuntimeError: set_sizes_contiguous is not allowed on Tensor created from .data or .detach()

Process finished with exit code 1

i tried altering "classifier[-1].weight.data.resize_(classifier[-1].weight.data.size(0) + classincrement, classifier[-1].weight.data.size(1))" to "classifier[-1].weight.resize(classifier[-1].weight.data.size(0) + class_increment,classifier[-1].weight.data.size(1))" but got the following error:

Traceback (most recent call last): File "C:/Users/RAR7ABT/pj-val-ml/pjval_ml/OSR/OCDVAE/main.py", line 328, in main() File "C:/Users/RAR7ABT/pj-val-ml/pjval_ml/OSR/OCDVAE/main.py", line 289, in main grow_classifier(model.module.classifier, args.num_increment_tasks, WeightInitializer) File "C:\Users\RAR7ABT\pj-val-ml\pjval_ml\OSR\OCDVAE\lib\Models\architectures.py", line 23, in growclassifier classifier[-1].weight.resize(classifier[-1].weight.data.size(0) + class_increment, RuntimeError: cannot resize variables that require grad

Hence i changed resize_() to reshape() but still ended up with:

Traceback (most recent call last): File "C:/Users/RAR7ABT/pj-val-ml/pjval_ml/OSR/OCDVAE/main.py", line 328, in main() File "C:/Users/RAR7ABT/pj-val-ml/pjval_ml/OSR/OCDVAE/main.py", line 289, in main grow_classifier(model.module.classifier, args.num_increment_tasks, WeightInitializer) File "C:\Users\RAR7ABT\pj-val-ml\pjval_ml\OSR\OCDVAE\lib\Models\architectures.py", line 23, in grow_classifier classifier[-1].weight.data.size(1)) RuntimeError: shape '[4, 60]' is invalid for input of size 120

i am not exactly sure where this size mismatch occurs. i would really appreciate it if someone took the time to look into it.

Hi, thanks for opening this issue. From what I can see my guess would be that you are using a newer PyTorch version than what we had used when creating this repository and something in the PyTorch internals must have changed. I have tried running the code on a fresh PyTorch 1.0 installation on Mac and Linux and it works. Could you please quickly do the same to check if that is the case for your (Windows?) machine (i.e. if it works with PyTorch 1.0, or whether the problem is in Windows PyTorch)?

Specifically, when you execute the code it seems to crash when a new task is encountered and the classifier gets extra units for the new classes (because the open set algorithm and sampling completes and the generative replay does as well). We had implemented this by in-place resizing the underlying weight tensor to have extra units and then needed to also resize the gradient tensors because otherwise the optimizer wouldn't automatically know about it. From the error message it looks like for some reason in-place resizing of a variable that is getting optimized is no longer valid in newer PyTorch versions.

I believe your reshape command doesn't really fix this because it is not just about switching dimensions around, but actually modifying the weights to be of larger dimensions, i.e. an actual resize with allocation of more memory.

Before proceeding to find a potential solution for newer PyTorch versions, could you please confirm that the code is also working for you with PyTorch 1.0, so that we don't debug on the wrong end? If that is the case I would have to take a deeper look again at the PyTorch patch notes, because I couldn't find any proposed alternative or why they should have removed this feature. I would probably have to open a PyTorch forum post if we can confirm the issue is with the newest version only.

Thanks in advance.

@MrtnMndt I apologize for my really late reply, I had some issues in downgrading my pytorch to 1.0 since my conda environment was corrupt. But like you said, it seems to work now. I have attached the result for your reference.

python main.py --epochs 1--incremental-data True --openset-generative-replay True --dataset MNIST

`Initializing network with: kaiming-normal C:\Users\RAR7ABT\AppData\Local\conda\conda\envs\test\lib\site-packages\torch\nn\modules\upsampling.py:129: UserWarning: nn.Upsample is deprecated. Use nn.functional.interpolate instead. warnings.warn("nn.{} is deprecated. Use nn.functional.interpolate instead.".format(self.name)) Training: [1][0/94] Time 34.186 (34.186) Data 19.431 (19.431) Loss 2.9964 (2.9964) Class Loss 1.0540 (1.0540) Prec@1 53.125 (53.125) Recon Loss 0.8578 (0.8578) KL 1.0846 (1.0846)

Train: Loss 10.18713 Prec@1 93.875 Validate: [1][0/16] Time 17.446 (17.446) Loss 6.3595 (6.3595) Class Loss 0.3632 (0.3632) Prec@1 95.312 (95.312) Recon Loss 298.0148 (298.0148) KL 347.8653 (347.8653)
Validation: Loss 7.08156 Prec@1 96.350 Saving the last checkpoint from the previous task ... Incrementing dataset ... Fitting Weibull models with tailsize: 300 Using generative model to replay old data with openset detection 100%|█████████▉| 11999/12000 [19:08<00:00, 5.49it/s]Openset sampling successful. Generating dataset 100%|██████████| 12000/12000 [19:08<00:00, 5.57it/s] 100%|██████████| 94/94 [00:12<00:00, 7.78it/s] Training: [2][0/188] Time 16.821 (16.821) Data 16.315 (16.315) Loss 26.9922 (26.9922) Class Loss 7.7087 (7.7087) Prec@1 31.250 (31.250) Recon Loss 0.5442 (0.5442) KL 18.7393 (18.7393) Training: [2][100/188] Time 1.318 (1.497) Data 0.868 (1.002) Loss 2.7647 (7.8378) Class Loss 0.2609 (1.2386) Prec@1 90.625 (87.167) Recon Loss 0.3828 (0.3980) KL 2.1209 (6.2013)
Train: Loss 5.26994 Prec@1 90.754 Validate: [2][0/32] Time 8.985 (8.985) Loss 6.8706 (6.8706) Class Loss 18.1945 (18.1945) Prec@1 47.656 (47.656) Recon Loss 309.5249 (309.5249) KL 115.6308 (115.6308)
Validation: Loss 6.91538 Prec@1 53.925
Incremental validation: Base Prec@1 16.650 New Prec@1 91.200 Base Recon Loss 0.385 New Recon Loss 0.397 Saving the last checkpoint from the previous task ... Incrementing dataset ... 0%| | 0/24000 [00:00<?, ?it/s]Fitting Weibull models with tailsize: 300 Using generative model to replay old data with openset detection 3%|▎ | 725/24000 [01:04<39:55, 9.72it/s] Open set generative replay from standard Gaussian failed. Trying sampling with modified variance bound 6%|▌ | 1433/24000 [02:09<37:07, 10.13it/s] 0%| | 0/187 [00:00<?, ?it/s] Open set generative replay timeout Using generative model to replay old data 100%|██████████| 187/187 [00:24<00:00, 7.57it/s] Training: [3][0/281] Time 8.509 (8.509) Data 8.010 (8.010) Loss 5.1222 (5.1222) Class Loss 3.4565 (3.4565) Prec@1 26.562 (26.562) Recon Loss 0.4627 (0.4627) KL 1.2030 (1.2030) Training: [3][100/281] Time 1.312 (1.391) Data 0.859 (0.942) Loss 2.1538 (3.2647) Class Loss 0.8707 (1.0759) Prec@1 59.375 (62.724) Recon Loss 0.4064 (0.4137) KL 0.8767 (1.7751) Training: [3][200/281] Time 1.328 (1.357) Data 0.890 (0.907) Loss 1.9851 (2.6659) Class Loss 0.8173 (0.9082) Prec@1 61.719 (66.123) Recon Loss 0.4076 (0.4093) KL 0.7602 (1.3485)
Train: Loss 2.46187 Prec@1 67.342 Validate: [3][0/47] Time 9.545 (9.545) Loss 5.0836 (5.0836) Class Loss 22.9982 (22.9982) Prec@1 47.656 (47.656) Recon Loss 296.5157 (296.5157) KL 52.3383 (52.3383)
Validation: Loss 5.66586 Prec@1 41.017
Incremental validation: Base Prec@1 16.300 New Prec@1 97.900 Base Recon Loss 0.405 New Recon Loss 0.344 Saving the last checkpoint from the previous task ... Incrementing dataset ... Fitting Weibull models with tailsize: 299 0%| | 0/35936 [00:00<?, ?it/s]Using generative model to replay old data with openset detection 0%| | 4/35936 [01:03<133:54:29, 13.42s/it] Open set generative replay from standard Gaussian failed. Trying sampling with modified variance bound 0%| | 10/35936 [02:05<76:29:02, 7.66s/it]

0%| | 0/280 [00:00<?, ?it/s] Open set generative replay timeout Using generative model to replay old data 100%|██████████| 280/280 [00:36<00:00, 7.69it/s] Training: [4][0/374] Time 8.602 (8.602) Data 8.101 (8.101) Loss 5.1212 (5.1212) Class Loss 3.7591 (3.7591) Prec@1 11.719 (11.719) Recon Loss 0.4173 (0.4173) KL 0.9447 (0.9447) Training: [4][100/374] Time 1.337 (1.398) Data 0.884 (0.947) Loss 2.4616 (2.7395) Class Loss 1.1036 (1.2488) Prec@1 58.594 (53.481) Recon Loss 0.4021 (0.4100) KL 0.9558 (1.0807) Training: [4][200/374] Time 1.355 (1.363) Data 0.880 (0.912) Loss 2.3106 (2.5043) Class Loss 0.9900 (1.1048) Prec@1 57.031 (58.015) Recon Loss 0.3940 (0.4046) KL 0.9266 (0.9950) Training: [4][300/374] Time 1.345 (1.355) Data 0.890 (0.902) Loss 2.3289 (2.4014) Class Loss 1.0450 (1.0544) Prec@1 57.031 (59.577) Recon Loss 0.4182 (0.4029) KL 0.8657 (0.9441)
Train: Loss 2.35291 Prec@1 60.299 Validate: [4][0/63] Time 8.394 (8.394) Loss 6.9810 (6.9810) Class Loss 46.0030 (46.0030) Prec@1 25.000 (25.000) Recon Loss 291.3834 (291.3834) KL 51.5351 (51.5351)
Validation: Loss 7.18785 Prec@1 26.637
Incremental validation: Base Prec@1 5.250 New Prec@1 94.350 Base Recon Loss 0.365 New Recon Loss 0.324 Saving the last checkpoint from the previous task ... Incrementing dataset ... Fitting Weibull models with tailsize: 299 Using generative model to replay old data with openset detection 0%| | 2/47840 [01:12<486:16:32, 36.59s/it] Open set generative replay from standard Gaussian failed. Trying sampling with modified variance bound

0%| | 0/373 [00:00<?, ?it/s] Open set generative replay timeout Using generative model to replay old data 100%|██████████| 373/373 [00:48<00:00, 7.64it/s] Training: [5][0/467] Time 9.052 (9.052) Data 8.598 (8.598) Loss 5.9071 (5.9071) Class Loss 4.6867 (4.6867) Prec@1 29.688 (29.688) Recon Loss 0.4128 (0.4128) KL 0.8077 (0.8077) Training: [5][100/467] Time 1.328 (1.405) Data 0.875 (0.954) Loss 2.6885 (2.9246) Class Loss 1.3585 (1.4574) Prec@1 48.438 (49.776) Recon Loss 0.3943 (0.3976) KL 0.9358 (1.0696) Training: [5][200/467] Time 1.329 (1.364) Data 0.875 (0.915) Loss 2.6306 (2.7434) Class Loss 1.2998 (1.3225) Prec@1 48.438 (52.627) Recon Loss 0.3859 (0.3926) KL 0.9449 (1.0283) Training: [5][300/467] Time 1.313 (1.351) Data 0.860 (0.902) Loss 2.5339 (2.6507) Class Loss 1.2450 (1.2644) Prec@1 53.906 (53.852) Recon Loss 0.3704 (0.3907) KL 0.9184 (0.9956) Training: [5][400/467] Time 1.345 (1.344) Data 0.875 (0.895) Loss 2.3704 (2.6004) Class Loss 1.1044 (1.2351) Prec@1 55.469 (54.580) Recon Loss 0.3783 (0.3895) KL 0.8877 (0.9757)
Train: Loss 2.57632 Prec@1 54.804 Validate: [5][0/79] Time 9.198 (9.198) Loss 5.7018 (5.7018) Class Loss 44.2696 (44.2696) Prec@1 35.156 (35.156) Recon Loss 303.8944 (303.8944) KL 53.2341 (53.2341)
Validation: Loss 5.51114 Prec@1 34.010
Incremental validation: Base Prec@1 15.300 New Prec@1 98.000 Base Recon Loss 0.418 New Recon Loss 0.371

Process finished with exit code 0 `

Thanks @RamyaRaghuraman for checking that it works with version 1.0 of PyTorch. We have been able to replicate the issue with PyTorch 1.1.. I have gone through the patch notes again and there is no note concerning the removal/deprecation of the functionality.

I have opened a PyTorch forum post in the hopes of getting an explanation and a pointer on how to make changes such that works again in PyTorch 1.1 here: https://discuss.pytorch.org/t/resizing-layers-weights-data-in-pytorch-1-1/52867 In the meantime I am changing the README to reflect that this repository requires PyTorch 1.0 and no newer version.

For now, I am leaving this issue open for other people to see and potentially comment/pull request if someone is able to find a solution.

Thanks again for catching this. PS: your above example ran through and you have probably just pasted it to show that the code executes fully. If you are however wondering why the Weibull part prints a "fail/time-out" it's because it has no proper model to work on after 1 epoch (i.e. you won't get any meaningful samples by drawing from the prior because the KL term hasn't been optimized yet) and we have defaulted to continuing the training without it instead of letting it crash.

The issue has been solved with a pull request from a community member: https://github.com/MrtnMndt/OCDVAE_ContinualLearning/pull/4

The code now works with more recent PyTorch versions.

MrtnMndt / OpenVAE_ContinualLearning

Issue in Models\Architecture.py #1