Closed jdgh000 closed 3 months ago
Hi @jdgh000 ,
I tried reproducing the error in Chapter 8's notebook but it didn't raise the error you reported.
Could you please paste the whole content of ch8-p134-full-classificiation-model.py
file so I can try to pinpoint what happened?
Best, Daniel
i looked at ipynb files and appears same. what i did is made some enhancement and literally moved few lines of code around. the one in ch8 should work in the attachment. p134.tar.gz What is your environment? I have centos9 stream, docker running cuda123 + torch121. This config works with lot of other torch, ml codes.
[nonroot@localhost code-exercises]$ nvidia-smi
Mon Jun 24 19:35:32 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08 Driver Version: 545.23.08 CUDA Version: 12.3 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 2070 ... Off | 00000000:01:00.0 On | N/A |
| 41% 34C P8 8W / 215W | 336MiB / 8192MiB | 13% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 1882 G /usr/libexec/Xorg 97MiB |
| 0 N/A N/A 1957 G /usr/bin/gnome-shell 63MiB |
| 0 N/A N/A 2575 G /usr/lib64/firefox/firefox 171MiB |
+---------------------------------------------------------------------------------------+
[nonroot@localhost code-exercises]$ cat /etc/os-release
NAME="CentOS Stream"
VERSION="9"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="9"
PLATFORM_ID="platform:el9"
PRETTY_NAME="CentOS Stream 9"
ANSI_COLOR="0;31"
LOGO="fedora-logo-icon"
CPE_NAME="cpe:/o:centos:centos:9"
HOME_URL="https://centos.org/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"
REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux 9"
REDHAT_SUPPORT_PRODUCT_VERSION="CentOS Stream"
[nonroot@localhost code-exercises]$ uname -r
5.14.0-412.el9.x86_64
[nonroot@localhost code-exercises]$ pip3 list | egrep torch
pytorch-triton 3.0.0+989adb9a29
torch 2.4.0.dev20240412+cu121
torchaudio 2.2.0.dev20240412+cu121
torchvision 0.19.0.dev20240412+cu121
Hi @jdgh000 ,
Thank you for providing your code.
I figured out the issue, it is in the SquareModel
class, which I pasted below with two corrected lines.
The error was triggered because the forward()
method was missing its self
argument. Once this was fixed, it raised a different error because the linear layer was assigned to self.classifiers
in the constructor while referred to as self.classifier
in the forward()
method.
By fixing these two small issues, training proceeds as expected.
Regarding the environment, I always test my code in Colab (like a "standard" environment since I trust Google will keep the env stable).
Best, Daniel
class SquareModel(nn.Module):
def __init__(self, n_features, hidden_dim, n_outputs):
super(SquareModel, self).__init__()
self.hidden_dim = hidden_dim
self.n_features = n_features
self.n_outputs = n_outputs
self.hidden = None
# Simple RNN
self.basic_rnn = nn.RNN(self.n_features, self.hidden_dim, batch_first = True)
# classifier to produce as many logits as outputs
# Original
# self.classifiers = nn.Linear(self.hidden_dim, self.n_outputs)
# Fixed
self.classifier = nn.Linear(self.hidden_dim, self.n_outputs)
# Original
# def forward(X):
# Fixed
def forward(self, X):
# X is batch first (N,L,F)
# output is (N,L,H)
# final hidden state is (1,N,H)
batch_first_output, self.hidden = self.basic_rnn(X)
# only last item in sequence (N,1,H)
last_output = batch_first_output[:, -1]
# classifier will output (N,1,n_outputs)
out = self.classifier(last_output)
# final outputs is (N, n_outputs)
return out.view(-1, self.n_outputs)
thx!! it worked. good catch!
Imported stepByStep v4.py
sbs_rnn=StepByStep(model, loss, optimizer) sbs_rnn.set_loaders(train_loader, test_loader) sbs_rnn.train(100) <----
this caused following fail: /home/nonroot/sbs//ch8/ch8-p134-full-classificiation-model.py:25: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at ../torch/csrc/utils/tensor_new.cpp:276.) train_data = TensorDataset(torch.as_tensor(points).float(),torch.as_tensor(directions).view(-1,1).float()) Traceback (most recent call last): File "/home/nonroot/sbs//ch8/ch8-p134-full-classificiation-model.py", line 38, in
sbs_rnn.train(100)
File "/home/nonroot/sbs//ch8/../stepbystep/v4.py", line 186, in train
loss = self._mini_batch(validation=False)
File "/home/nonroot/sbs//ch8/../stepbystep/v4.py", line 155, in _mini_batch
mini_batch_loss = step_fn(x_batch, y_batch)
File "/home/nonroot/sbs//ch8/../stepbystep/v4.py", line 99, in perform_train_step_fn
yhat = self.model(x)
File "/home/guyen/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, *kwargs)
File "/home/guyen/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(args, **kwargs)
TypeError: forward() takes 1 positional argument but 2 were given