MrtnMndt / OpenVAE_ContinualLearning

Open-source code for our paper: Unified Probabilistic Deep Continual Learning through Generative Replay and Open Set Recognition
https://doi.org/10.3390/jimaging8040093
MIT License
59 stars 12 forks source link

Issue in Models\Architecture.py #1

Closed RamyaRaghuraman closed 4 years ago

RamyaRaghuraman commented 4 years ago

Hello everyone, I got the following error while running the command:: python3 main.py --incremental-data True --openset-generative-replay True --dataset MNIST

Fitting Weibull models with tailsize: 300 0%| | 0/12000 [00:00<?, ?it/s]Using generative model to replay old data with openset detection 100%|██████████| 12000/12000 [04:44<00:00, 21.32it/s] Openset sampling successful. Generating dataset 100%|██████████| 94/94 [00:12<00:00, 7.76it/s] Traceback (most recent call last): File "C:/Users/RAR7ABT/pj-val-ml/pjval_ml/OSR/OCDVAE/main.py", line 328, in main() File "C:/Users/RAR7ABT/pj-val-ml/pjval_ml/OSR/OCDVAE/main.py", line 289, in main grow_classifier(model.module.classifier, args.num_increment_tasks, WeightInitializer) File "C:\Users\RAR7ABT\pj-val-ml\pjval_ml\OSR\OCDVAE\lib\Models\architectures.py", line 23, in grow_classifier classifier[-1].weight.data.size(1)) RuntimeError: set_sizes_contiguous is not allowed on Tensor created from .data or .detach()

Process finished with exit code 1

i tried altering "classifier[-1].weight.data.resize_(classifier[-1].weight.data.size(0) + classincrement, classifier[-1].weight.data.size(1))" to "classifier[-1].weight.resize(classifier[-1].weight.data.size(0) + class_increment,classifier[-1].weight.data.size(1))" but got the following error:

Traceback (most recent call last): File "C:/Users/RAR7ABT/pj-val-ml/pjval_ml/OSR/OCDVAE/main.py", line 328, in main() File "C:/Users/RAR7ABT/pj-val-ml/pjval_ml/OSR/OCDVAE/main.py", line 289, in main grow_classifier(model.module.classifier, args.num_increment_tasks, WeightInitializer) File "C:\Users\RAR7ABT\pj-val-ml\pjval_ml\OSR\OCDVAE\lib\Models\architectures.py", line 23, in growclassifier classifier[-1].weight.resize(classifier[-1].weight.data.size(0) + class_increment, RuntimeError: cannot resize variables that require grad

Hence i changed resize_() to reshape() but still ended up with:

Traceback (most recent call last): File "C:/Users/RAR7ABT/pj-val-ml/pjval_ml/OSR/OCDVAE/main.py", line 328, in main() File "C:/Users/RAR7ABT/pj-val-ml/pjval_ml/OSR/OCDVAE/main.py", line 289, in main grow_classifier(model.module.classifier, args.num_increment_tasks, WeightInitializer) File "C:\Users\RAR7ABT\pj-val-ml\pjval_ml\OSR\OCDVAE\lib\Models\architectures.py", line 23, in grow_classifier classifier[-1].weight.data.size(1)) RuntimeError: shape '[4, 60]' is invalid for input of size 120

i am not exactly sure where this size mismatch occurs. i would really appreciate it if someone took the time to look into it.

MrtnMndt commented 4 years ago

Hi, thanks for opening this issue. From what I can see my guess would be that you are using a newer PyTorch version than what we had used when creating this repository and something in the PyTorch internals must have changed. I have tried running the code on a fresh PyTorch 1.0 installation on Mac and Linux and it works. Could you please quickly do the same to check if that is the case for your (Windows?) machine (i.e. if it works with PyTorch 1.0, or whether the problem is in Windows PyTorch)?

Specifically, when you execute the code it seems to crash when a new task is encountered and the classifier gets extra units for the new classes (because the open set algorithm and sampling completes and the generative replay does as well). We had implemented this by in-place resizing the underlying weight tensor to have extra units and then needed to also resize the gradient tensors because otherwise the optimizer wouldn't automatically know about it. From the error message it looks like for some reason in-place resizing of a variable that is getting optimized is no longer valid in newer PyTorch versions.

I believe your reshape command doesn't really fix this because it is not just about switching dimensions around, but actually modifying the weights to be of larger dimensions, i.e. an actual resize with allocation of more memory.

Before proceeding to find a potential solution for newer PyTorch versions, could you please confirm that the code is also working for you with PyTorch 1.0, so that we don't debug on the wrong end? If that is the case I would have to take a deeper look again at the PyTorch patch notes, because I couldn't find any proposed alternative or why they should have removed this feature. I would probably have to open a PyTorch forum post if we can confirm the issue is with the newest version only.

Thanks in advance.

RamyaRaghuraman commented 4 years ago

@MrtnMndt I apologize for my really late reply, I had some issues in downgrading my pytorch to 1.0 since my conda environment was corrupt. But like you said, it seems to work now. I have attached the result for your reference.

python main.py --epochs 1--incremental-data True --openset-generative-replay True --dataset MNIST

`Initializing network with: kaiming-normal C:\Users\RAR7ABT\AppData\Local\conda\conda\envs\test\lib\site-packages\torch\nn\modules\upsampling.py:129: UserWarning: nn.Upsample is deprecated. Use nn.functional.interpolate instead. warnings.warn("nn.{} is deprecated. Use nn.functional.interpolate instead.".format(self.name)) Training: [1][0/94] Time 34.186 (34.186) Data 19.431 (19.431) Loss 2.9964 (2.9964) Class Loss 1.0540 (1.0540) Prec@1 53.125 (53.125) Recon Loss 0.8578 (0.8578) KL 1.0846 (1.0846)

Process finished with exit code 0 `

MrtnMndt commented 4 years ago

Thanks @RamyaRaghuraman for checking that it works with version 1.0 of PyTorch. We have been able to replicate the issue with PyTorch 1.1.. I have gone through the patch notes again and there is no note concerning the removal/deprecation of the functionality.

I have opened a PyTorch forum post in the hopes of getting an explanation and a pointer on how to make changes such that works again in PyTorch 1.1 here: https://discuss.pytorch.org/t/resizing-layers-weights-data-in-pytorch-1-1/52867 In the meantime I am changing the README to reflect that this repository requires PyTorch 1.0 and no newer version.

For now, I am leaving this issue open for other people to see and potentially comment/pull request if someone is able to find a solution.

Thanks again for catching this. PS: your above example ran through and you have probably just pasted it to show that the code executes fully. If you are however wondering why the Weibull part prints a "fail/time-out" it's because it has no proper model to work on after 1 epoch (i.e. you won't get any meaningful samples by drawing from the prior because the KL term hasn't been optimized yet) and we have defaulted to continuing the training without it instead of letting it crash.

MrtnMndt commented 4 years ago

The issue has been solved with a pull request from a community member: https://github.com/MrtnMndt/OCDVAE_ContinualLearning/pull/4

The code now works with more recent PyTorch versions.