Open poemon opened 1 year ago
Hey!
Thanks for opening the issue. I will try to help you here.
installed the latest version of PyTorch
It could be that this is already the main issue. The requirement listed is pytorch 1.4.0, but it could also be that the earlier version was just masking a bug. So let's check further.
I searched on Google and made some improvements to the line of code
q *= dkh ** -0.5
What improvements did you do? The error message sounds like you can circumvent this by assigning the result to a new variable instead of q
. Have you tried this?
Now it doesn't stop at 30% anymore, but it starts from 0 again
What do you mean by this? The training? I somehow don't see whether an error results in a reset of the progress until then.
Oh my goodness, you replied to me so quickly! Thank you very much
My current version of PyTorch is 2.1.0.
I changed the B in A toI changed the q *= dkh ** -0.5
in attention_augmented_conv.py to
tmp_tensor = dkh ** -0.5
q = q * tmp_tensor
OR
q = q * dkh ** -0.5
neither will do
This change was made because of the solution given to me when I searched google for the problem
RuntimeError: Output 0 of ReshapeAliasBackward0 is a view and is being modified inplace. This view is the output of a function that returns multiple views. Such functions do not allow the output views to be modified inplace. You should replace the inplace operation by an out-of-place one.
When I changed it like this, the error message changed to
NL dataset. Step: 1
Device: cuda
----------------------------------------------------------------
Layer (type) Output Shape Param #
================================================================
Conv2d-1 [-1, 32, 5, 5] 928
Linear-2 [-1, 128] 102,528
Linear-3 [-1, 64] 8,256
Linear-4 [-1, 7] 455
DoubleDense-5 [-1, 7] 0
================================================================
Total params: 112,167
Trainable params: 112,167
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.00
Forward/backward pass size (MB): 0.01
Params size (MB): 0.43
Estimated Total Size (MB): 0.44
----------------------------------------------------------------
Epochs: 30%| | 50/150 [00:00<?, ?it/s]
Device: cuda
----------------------------------------------------------------
Layer (type) Output Shape Param #
================================================================
Conv2d-1 [-1, 32, 5, 5] 928
Linear-2 [-1, 128] 102,528
Linear-3 [-1, 64] 8,256
Linear-4 [-1, 7] 455
DoubleDense-5 [-1, 7] 0
================================================================
Total params: 112,167
Trainable params: 112,167
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.00
Forward/backward pass size (MB): 0.01
Params size (MB): 0.43
Estimated Total Size (MB): 0.44
----------------------------------------------------------------
Epochs: 0%| | 0/150 [00:00<?, ?it/s]
You know what I mean? The code doesn't stop when it reaches 30%, it starts all over again.
I've also tried a combination of python3.8 + pytorch1.4
But this creates a new problem
Traceback (most recent call last):
File "D:\python_project\multidim_conv-master\multidim_conv-master\train.py", line 1, in <module>
import torch
File "C:\Users\xinyi\.conda\envs\pyt38\lib\site-packages\torch\__init__.py", line 44, in <module>
import numpy as _np
File "C:\Users\xinyi\.conda\envs\pyt38\lib\site-packages\numpy\__init__.py", line 125, in <module>
from numpy.__config__ import show as show_config
File "C:\Users\xinyi\.conda\envs\pyt38\lib\site-packages\numpy\__config__.py", line 12, in <module>
os.add_dll_directory(extra_dll_dir)
AttributeError: module 'os' has no attribute 'add_dll_directory'
So do you know what the problem is, should I be looking for problems with the new version or go back to the old one, thanks again.
That is indeed weird. I do not see any reason why it should restart. Did you maybe also remove the print statements in train.py
? The script does train 8 models in sequence: 4 timesteps and for the NL and DK dataset (https://github.com/HansBambel/multidim_conv/blob/1caae1fd1fc710084df1438aa1a76fd35c56cd3b/train.py#L401-L407).
I've also tried a combination of python3.8 + pytorch1.4
I think I used Python 3.7.
The Error AttributeError: module 'os' has no attribute 'add_dll_directory'
led me here where it is said that add_dll_directory
was only introduced in 3.8.
Maybe you can create a new env with 3.7 and try that again.
I have lowered the version as you instructed I have created a new virtual environment with Python version 3.7. The versions of other packages are as follows:
Package Version
----------------------- ----------
absl-py 2.0.0
cachetools 5.3.1
certifi 2022.12.7
charset-normalizer 3.3.0
colorama 0.4.6
einops 0.2.0
google-auth 2.23.3
google-auth-oauthlib 0.4.6
grpcio 1.59.0
idna 3.4
importlib-metadata 6.7.0
Markdown 3.4.4
MarkupSafe 2.1.3
numpy 1.21.6
oauthlib 3.2.2
olefile 0.46
Pillow 9.5.0
pip 22.3.1
protobuf 3.20.3
pyasn1 0.5.0
pyasn1-modules 0.3.0
requests 2.31.0
requests-oauthlib 1.3.1
rsa 4.9
scipy 1.7.3
setuptools 65.6.3
six 1.16.0
tensorboard 2.11.2
tensorboard-data-server 0.6.1
tensorboard-plugin-wit 1.8.1
torch 1.4.0+cu92
torchsummary 1.5.1
torchvision 0.5.0+cu92
tqdm 4.66.1
typing_extensions 4.7.1
urllib3 2.0.6
Werkzeug 2.2.3
wheel 0.38.4
wincertstore 0.2
zipp 3.15.0
It reported an error at 52%.
NL dataset. Step: 1
Device: cuda
----------------------------------------------------------------
Layer (type) Output Shape Param #
================================================================
Conv2d-1 [-1, 32, 5, 5] 928
Linear-2 [-1, 128] 102,528
Linear-3 [-1, 64] 8,256
Linear-4 [-1, 7] 455
DoubleDense-5 [-1, 7] 0
================================================================
Total params: 112,167
Trainable params: 112,167
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.00
Forward/backward pass size (MB): 0.01
Params size (MB): 0.43
Estimated Total Size (MB): 0.44
----------------------------------------------------------------
Epochs: 52%|█████▏ | 78/150 [15:18<14:07, 11.78s/it]
Stopping early --> val_loss has not decreased over 20 epochs
Traceback (most recent call last):
File "D:\python_project\multidim_conv-master\multidim_conv-master\train.py", line 404, in <module>
train_wind_nl(folder+data, epochs=150, input_timesteps=6, prediction_timestep=t, dev=dev, earlystopping=20)
File "D:\python_project\multidim_conv-master\multidim_conv-master\train.py", line 307, in train_wind_nl
summary(model, (7, input_timesteps, 6), device="cpu")
File "C:\Users\xinyi\.conda\envs\pytorch\lib\site-packages\torchsummary\torchsummary.py", line 72, in summary
model(*x)
File "C:\Users\xinyi\.conda\envs\pytorch\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "C:\Users\xinyi\.conda\envs\pytorch\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "D:\python_project\multidim_conv-master\multidim_conv-master\models\wind_models.py", line 93, in forward
x = F.relu(self.conv1(x))
File "C:\Users\xinyi\.conda\envs\pytorch\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "C:\Users\xinyi\.conda\envs\pytorch\lib\site-packages\torch\nn\modules\module.py", line 1568, in _call_impl
result = forward_call(*args, **kwargs)
File "D:\python_project\multidim_conv-master\multidim_conv-master\models\attention_augmented_conv.py", line 52, in forward
flat_q, flat_k, flat_v, q, k, v = self.compute_flat_qkv(x, self.dk, self.dv, self.Nh)
File "D:\python_project\multidim_conv-master\multidim_conv-master\models\attention_augmented_conv.py", line 79, in compute_flat_qkv
q *= dkh ** -0.5
RuntimeError: Output 0 of ReshapeAliasBackward0 is a view and is being modified inplace. This view is the output of a function that returns multiple views. Such functions do not allow the output views to be modified inplace. You should replace the inplace operation by an out-of-place one.
What can I do to get your code to work correctly Thank you very much!
I don't know why it is not working for you, but it is not crashing from the training. Because of Stopping early --> val_loss has not decreased over 20 epochs
we can see that the crash happened after that..
Does the model get saved in models/trained_models/wind_model_NL_{prediction_timestep}h_{model.__class__.__name__}.pt
?
Yes, there are two files in the trained_models folder
wind_model_NL_1h_CNN2DWind_NL.pt 445kb
wind_model_NL_1h_CNN2DAttWind_NL.pt 453kb
Can you try to start the training for 2 steps ahead? So remove the 1
from here:
https://github.com/HansBambel/multidim_conv/blob/1caae1fd1fc710084df1438aa1a76fd35c56cd3b/train.py#L401-L407
It seems like the problem occurs then.
I'm sorry for replying to you so late. I've been a bit busy these past couple of days I followed what you said and removed 1, but the result is still the same. I'm about to give up
for t in [2, 3, 4]:
Have you tried running your code?
I was just able to reproduce the problem. It was indeed in this line q *= dkh ** -0.5
.
I fixed it by renaming q
to q_new
:
def compute_flat_qkv(self, x, dk, dv, Nh):
qkv = self.qkv_conv(x)
N, _, H, W = qkv.size()
q, k, v = torch.split(qkv, [dk, dk, dv], dim=1)
q = self.split_heads_2d(q, Nh)
k = self.split_heads_2d(k, Nh)
v = self.split_heads_2d(v, Nh)
dkh = dk // Nh
q_new = q * dkh ** -0.5
flat_q = torch.reshape(q, (N, Nh, dk // Nh, H * W))
flat_k = torch.reshape(k, (N, Nh, dk // Nh, H * W))
flat_v = torch.reshape(v, (N, Nh, dv // Nh, H * W))
return flat_q, flat_k, flat_v, q_new, k, v
When you get the latest from master it should work. Note that when executing train.py
5 models get trained at each time step for each dataset.
Furthermore, I activated some more prints to show progress.
@poemon Did this fix your issue?
Hello author, I created a new virtual environment in Anaconda and installed the latest version of PyTorch. However, when running the code, it stops and throws an error around 30%. The error code is shown above. I searched on Google and made some improvements to the line of code
q *= dkh ** -0.5
. Now it doesn't stop at 30% anymore, but it starts from 0 again. I spent a long time trying to figure it out but couldn't. I hope to get your answer on what I should do to run your code.