P134 example failing with v4.py StepByStep.train()

jdgh000 commented 3 months ago

Imported stepByStep v4.py

sbs_rnn=StepByStep(model, loss, optimizer) sbs_rnn.set_loaders(train_loader, test_loader) sbs_rnn.train(100) <----

this caused following fail: /home/nonroot/sbs//ch8/ch8-p134-full-classificiation-model.py:25: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at ../torch/csrc/utils/tensor_new.cpp:276.) train_data = TensorDataset(torch.as_tensor(points).float(),torch.as_tensor(directions).view(-1,1).float()) Traceback (most recent call last): File "/home/nonroot/sbs//ch8/ch8-p134-full-classificiation-model.py", line 38, in sbs_rnn.train(100) File "/home/nonroot/sbs//ch8/../stepbystep/v4.py", line 186, in train loss = self._mini_batch(validation=False) File "/home/nonroot/sbs//ch8/../stepbystep/v4.py", line 155, in _mini_batch mini_batch_loss = step_fn(x_batch, y_batch) File "/home/nonroot/sbs//ch8/../stepbystep/v4.py", line 99, in perform_train_step_fn yhat = self.model(x) File "/home/guyen/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/home/guyen/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(args, **kwargs) TypeError: forward() takes 1 positional argument but 2 were given

dvgodoy commented 3 months ago

Hi @jdgh000 ,

I tried reproducing the error in Chapter 8's notebook but it didn't raise the error you reported. Could you please paste the whole content of ch8-p134-full-classificiation-model.py file so I can try to pinpoint what happened?

Best, Daniel

jdgh000 commented 3 months ago

i looked at ipynb files and appears same. what i did is made some enhancement and literally moved few lines of code around. the one in ch8 should work in the attachment. p134.tar.gz What is your environment? I have centos9 stream, docker running cuda123 + torch121. This config works with lot of other torch, ml codes.

[nonroot@localhost code-exercises]$ nvidia-smi 
Mon Jun 24 19:35:32 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08              Driver Version: 545.23.08    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 2070 ...    Off | 00000000:01:00.0  On |                  N/A |
| 41%   34C    P8               8W / 215W |    336MiB /  8192MiB |     13%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      1882      G   /usr/libexec/Xorg                            97MiB |
|    0   N/A  N/A      1957      G   /usr/bin/gnome-shell                         63MiB |
|    0   N/A  N/A      2575      G   /usr/lib64/firefox/firefox                  171MiB |
+---------------------------------------------------------------------------------------+
[nonroot@localhost code-exercises]$ cat /etc/os-release 
NAME="CentOS Stream"
VERSION="9"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="9"
PLATFORM_ID="platform:el9"
PRETTY_NAME="CentOS Stream 9"
ANSI_COLOR="0;31"
LOGO="fedora-logo-icon"
CPE_NAME="cpe:/o:centos:centos:9"
HOME_URL="https://centos.org/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"
REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux 9"
REDHAT_SUPPORT_PRODUCT_VERSION="CentOS Stream"
[nonroot@localhost code-exercises]$ uname -r
5.14.0-412.el9.x86_64
[nonroot@localhost code-exercises]$ pip3 list | egrep torch
pytorch-triton           3.0.0+989adb9a29
torch                    2.4.0.dev20240412+cu121
torchaudio               2.2.0.dev20240412+cu121
torchvision              0.19.0.dev20240412+cu121

dvgodoy commented 3 months ago

Hi @jdgh000 ,

Thank you for providing your code. I figured out the issue, it is in the SquareModel class, which I pasted below with two corrected lines. The error was triggered because the forward() method was missing its self argument. Once this was fixed, it raised a different error because the linear layer was assigned to self.classifiers in the constructor while referred to as self.classifier in the forward() method. By fixing these two small issues, training proceeds as expected.

Regarding the environment, I always test my code in Colab (like a "standard" environment since I trust Google will keep the env stable).

Best, Daniel

class SquareModel(nn.Module):
    def __init__(self, n_features, hidden_dim, n_outputs):
        super(SquareModel, self).__init__()
        self.hidden_dim = hidden_dim
        self.n_features = n_features
        self.n_outputs = n_outputs
        self.hidden = None

        # Simple RNN

        self.basic_rnn = nn.RNN(self.n_features, self.hidden_dim, batch_first = True)

        # classifier to produce as many logits as outputs

        # Original
        # self.classifiers = nn.Linear(self.hidden_dim, self.n_outputs)
        # Fixed
        self.classifier = nn.Linear(self.hidden_dim, self.n_outputs)

    # Original
    # def forward(X):
    # Fixed
    def forward(self, X):
        # X is batch first (N,L,F)
        # output is (N,L,H)
        # final hidden state is (1,N,H)

        batch_first_output, self.hidden = self.basic_rnn(X)

        # only last item in sequence (N,1,H)

        last_output = batch_first_output[:, -1]

        # classifier will output (N,1,n_outputs)

        out = self.classifier(last_output)

        # final outputs is (N, n_outputs)
        return out.view(-1, self.n_outputs)

jdgh000 commented 3 months ago

thx!! it worked. good catch!

dvgodoy / PyTorchStepByStep

P134 example failing with v4.py StepByStep.train() #49