HansBambel / multidim_conv

PyTorch code for the Paper "Wind speed prediction using multidimensional convolutional neuralnetworks"
17 stars 6 forks source link

Such functions do not allow the output views to be modified inplace #2

Open poemon opened 1 year ago

poemon commented 1 year ago
  File "D:\python_project\multidim_conv-master\multidim_conv-master\train.py", line 404, in <module>
    train_wind_nl(folder+data, epochs=150, input_timesteps=6, prediction_timestep=t, dev=dev, earlystopping=20)
  File "D:\python_project\multidim_conv-master\multidim_conv-master\train.py", line 307, in train_wind_nl
    summary(model, (7, input_timesteps, 6), device="cpu")
  File "C:\Users\xinyi\.conda\envs\pytorch\lib\site-packages\torchsummary\torchsummary.py", line 72, in summary
    model(*x)
  File "C:\Users\xinyi\.conda\envs\pytorch\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "C:\Users\xinyi\.conda\envs\pytorch\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\python_project\multidim_conv-master\multidim_conv-master\models\wind_models.py", line 93, in forward
    x = F.relu(self.conv1(x))
  File "C:\Users\xinyi\.conda\envs\pytorch\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "C:\Users\xinyi\.conda\envs\pytorch\lib\site-packages\torch\nn\modules\module.py", line 1568, in _call_impl
    result = forward_call(*args, **kwargs)
  File "D:\python_project\multidim_conv-master\multidim_conv-master\models\attention_augmented_conv.py", line 52, in forward
    flat_q, flat_k, flat_v, q, k, v = self.compute_flat_qkv(x, self.dk, self.dv, self.Nh)
  File "D:\python_project\multidim_conv-master\multidim_conv-master\models\attention_augmented_conv.py", line 79, in compute_flat_qkv
    q *= dkh ** -0.5
RuntimeError: Output 0 of ReshapeAliasBackward0 is a view and is being modified inplace. This view is the output of a function that returns multiple views. Such functions do not allow the output views to be modified inplace. You should replace the inplace operation by an out-of-place one. 

Hello author, I created a new virtual environment in Anaconda and installed the latest version of PyTorch. However, when running the code, it stops and throws an error around 30%. The error code is shown above. I searched on Google and made some improvements to the line of code q *= dkh ** -0.5. Now it doesn't stop at 30% anymore, but it starts from 0 again. I spent a long time trying to figure it out but couldn't. I hope to get your answer on what I should do to run your code.

HansBambel commented 1 year ago

Hey!

Thanks for opening the issue. I will try to help you here.

installed the latest version of PyTorch

It could be that this is already the main issue. The requirement listed is pytorch 1.4.0, but it could also be that the earlier version was just masking a bug. So let's check further.

I searched on Google and made some improvements to the line of code q *= dkh ** -0.5

What improvements did you do? The error message sounds like you can circumvent this by assigning the result to a new variable instead of q. Have you tried this?

Now it doesn't stop at 30% anymore, but it starts from 0 again

What do you mean by this? The training? I somehow don't see whether an error results in a reset of the progress until then.

poemon commented 1 year ago

Oh my goodness, you replied to me so quickly! Thank you very much

My current version of PyTorch is 2.1.0. I changed the B in A toI changed the q *= dkh ** -0.5 in attention_augmented_conv.py to

tmp_tensor = dkh ** -0.5
q = q * tmp_tensor

OR

q = q * dkh ** -0.5

neither will do

This change was made because of the solution given to me when I searched google for the problem

RuntimeError: Output 0 of ReshapeAliasBackward0 is a view and is being modified inplace. This view is the output of a function that returns multiple views. Such functions do not allow the output views to be modified inplace. You should replace the inplace operation by an out-of-place one.

When I changed it like this, the error message changed to

NL dataset. Step:  1
Device: cuda
----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
================================================================
            Conv2d-1             [-1, 32, 5, 5]             928
            Linear-2                  [-1, 128]         102,528
            Linear-3                   [-1, 64]           8,256
            Linear-4                    [-1, 7]             455
       DoubleDense-5                    [-1, 7]               0
================================================================
Total params: 112,167
Trainable params: 112,167
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.00
Forward/backward pass size (MB): 0.01
Params size (MB): 0.43
Estimated Total Size (MB): 0.44
----------------------------------------------------------------
Epochs:   30%|          | 50/150 [00:00<?, ?it/s]
Device: cuda
----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
================================================================
            Conv2d-1             [-1, 32, 5, 5]             928
            Linear-2                  [-1, 128]         102,528
            Linear-3                   [-1, 64]           8,256
            Linear-4                    [-1, 7]             455
       DoubleDense-5                    [-1, 7]               0
================================================================
Total params: 112,167
Trainable params: 112,167
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.00
Forward/backward pass size (MB): 0.01
Params size (MB): 0.43
Estimated Total Size (MB): 0.44
----------------------------------------------------------------
Epochs:   0%|          | 0/150 [00:00<?, ?it/s]

You know what I mean? The code doesn't stop when it reaches 30%, it starts all over again.

I've also tried a combination of python3.8 + pytorch1.4

But this creates a new problem

Traceback (most recent call last):
  File "D:\python_project\multidim_conv-master\multidim_conv-master\train.py", line 1, in <module>
    import torch
  File "C:\Users\xinyi\.conda\envs\pyt38\lib\site-packages\torch\__init__.py", line 44, in <module>
    import numpy as _np
  File "C:\Users\xinyi\.conda\envs\pyt38\lib\site-packages\numpy\__init__.py", line 125, in <module>
    from numpy.__config__ import show as show_config
  File "C:\Users\xinyi\.conda\envs\pyt38\lib\site-packages\numpy\__config__.py", line 12, in <module>
    os.add_dll_directory(extra_dll_dir)
AttributeError: module 'os' has no attribute 'add_dll_directory' 

So do you know what the problem is, should I be looking for problems with the new version or go back to the old one, thanks again.

HansBambel commented 1 year ago

That is indeed weird. I do not see any reason why it should restart. Did you maybe also remove the print statements in train.py? The script does train 8 models in sequence: 4 timesteps and for the NL and DK dataset (https://github.com/HansBambel/multidim_conv/blob/1caae1fd1fc710084df1438aa1a76fd35c56cd3b/train.py#L401-L407).

I've also tried a combination of python3.8 + pytorch1.4

I think I used Python 3.7.

The Error AttributeError: module 'os' has no attribute 'add_dll_directory' led me here where it is said that add_dll_directory was only introduced in 3.8.

Maybe you can create a new env with 3.7 and try that again.

poemon commented 1 year ago

I have lowered the version as you instructed I have created a new virtual environment with Python version 3.7. The versions of other packages are as follows:

Package                 Version
----------------------- ----------
absl-py                 2.0.0
cachetools              5.3.1
certifi                 2022.12.7
charset-normalizer      3.3.0
colorama                0.4.6
einops                  0.2.0
google-auth             2.23.3
google-auth-oauthlib    0.4.6
grpcio                  1.59.0
idna                    3.4
importlib-metadata      6.7.0
Markdown                3.4.4
MarkupSafe              2.1.3
numpy                   1.21.6
oauthlib                3.2.2
olefile                 0.46
Pillow                  9.5.0
pip                     22.3.1
protobuf                3.20.3
pyasn1                  0.5.0
pyasn1-modules          0.3.0
requests                2.31.0
requests-oauthlib       1.3.1
rsa                     4.9
scipy                   1.7.3
setuptools              65.6.3
six                     1.16.0
tensorboard             2.11.2
tensorboard-data-server 0.6.1
tensorboard-plugin-wit  1.8.1
torch                   1.4.0+cu92
torchsummary            1.5.1
torchvision             0.5.0+cu92
tqdm                    4.66.1
typing_extensions       4.7.1
urllib3                 2.0.6
Werkzeug                2.2.3
wheel                   0.38.4
wincertstore            0.2
zipp                    3.15.0

It reported an error at 52%.

NL dataset. Step:  1
Device: cuda
----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
================================================================
            Conv2d-1             [-1, 32, 5, 5]             928
            Linear-2                  [-1, 128]         102,528
            Linear-3                   [-1, 64]           8,256
            Linear-4                    [-1, 7]             455
       DoubleDense-5                    [-1, 7]               0
================================================================
Total params: 112,167
Trainable params: 112,167
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.00
Forward/backward pass size (MB): 0.01
Params size (MB): 0.43
Estimated Total Size (MB): 0.44
----------------------------------------------------------------
Epochs:  52%|█████▏    | 78/150 [15:18<14:07, 11.78s/it]
Stopping early --> val_loss has not decreased over 20 epochs
Traceback (most recent call last):
  File "D:\python_project\multidim_conv-master\multidim_conv-master\train.py", line 404, in <module>
    train_wind_nl(folder+data, epochs=150, input_timesteps=6, prediction_timestep=t, dev=dev, earlystopping=20)
  File "D:\python_project\multidim_conv-master\multidim_conv-master\train.py", line 307, in train_wind_nl
    summary(model, (7, input_timesteps, 6), device="cpu")
  File "C:\Users\xinyi\.conda\envs\pytorch\lib\site-packages\torchsummary\torchsummary.py", line 72, in summary
    model(*x)
  File "C:\Users\xinyi\.conda\envs\pytorch\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "C:\Users\xinyi\.conda\envs\pytorch\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\python_project\multidim_conv-master\multidim_conv-master\models\wind_models.py", line 93, in forward
    x = F.relu(self.conv1(x))
  File "C:\Users\xinyi\.conda\envs\pytorch\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "C:\Users\xinyi\.conda\envs\pytorch\lib\site-packages\torch\nn\modules\module.py", line 1568, in _call_impl
    result = forward_call(*args, **kwargs)
  File "D:\python_project\multidim_conv-master\multidim_conv-master\models\attention_augmented_conv.py", line 52, in forward
    flat_q, flat_k, flat_v, q, k, v = self.compute_flat_qkv(x, self.dk, self.dv, self.Nh)
  File "D:\python_project\multidim_conv-master\multidim_conv-master\models\attention_augmented_conv.py", line 79, in compute_flat_qkv
    q *= dkh ** -0.5
RuntimeError: Output 0 of ReshapeAliasBackward0 is a view and is being modified inplace. This view is the output of a function that returns multiple views. Such functions do not allow the output views to be modified inplace. You should replace the inplace operation by an out-of-place one.

What can I do to get your code to work correctly Thank you very much!

HansBambel commented 1 year ago

I don't know why it is not working for you, but it is not crashing from the training. Because of Stopping early --> val_loss has not decreased over 20 epochs we can see that the crash happened after that..

Does the model get saved in models/trained_models/wind_model_NL_{prediction_timestep}h_{model.__class__.__name__}.pt?

poemon commented 1 year ago

Yes, there are two files in the trained_models folder

wind_model_NL_1h_CNN2DWind_NL.pt  445kb
wind_model_NL_1h_CNN2DAttWind_NL.pt  453kb
HansBambel commented 1 year ago

Can you try to start the training for 2 steps ahead? So remove the 1 from here: https://github.com/HansBambel/multidim_conv/blob/1caae1fd1fc710084df1438aa1a76fd35c56cd3b/train.py#L401-L407

It seems like the problem occurs then.

poemon commented 1 year ago

I'm sorry for replying to you so late. I've been a bit busy these past couple of days I followed what you said and removed 1, but the result is still the same. I'm about to give up

for t in [2, 3, 4]: 

Have you tried running your code?

HansBambel commented 1 year ago

I was just able to reproduce the problem. It was indeed in this line q *= dkh ** -0.5.

I fixed it by renaming q to q_new:

def compute_flat_qkv(self, x, dk, dv, Nh):
    qkv = self.qkv_conv(x)
    N, _, H, W = qkv.size()
    q, k, v = torch.split(qkv, [dk, dk, dv], dim=1)
    q = self.split_heads_2d(q, Nh)
    k = self.split_heads_2d(k, Nh)
    v = self.split_heads_2d(v, Nh)

    dkh = dk // Nh
    q_new = q * dkh ** -0.5
    flat_q = torch.reshape(q, (N, Nh, dk // Nh, H * W))
    flat_k = torch.reshape(k, (N, Nh, dk // Nh, H * W))
    flat_v = torch.reshape(v, (N, Nh, dv // Nh, H * W))
    return flat_q, flat_k, flat_v, q_new, k, v

When you get the latest from master it should work. Note that when executing train.py 5 models get trained at each time step for each dataset.

Furthermore, I activated some more prints to show progress.

HansBambel commented 1 year ago

@poemon Did this fix your issue?