RuntimeError: output with shape [256] doesn't match the broadcast shape [256, 256]

ajay-vikram commented 3 months ago

I have trained a Recurrent network using an LSTMCell and MLP layers. But when I load the model and the weights for running the benchmark, I get "RuntimeError: output with shape [256] doesn't match the broadcast shape [256, 256]". Tracing it backwards, it originates from the utils.py file on line 291 (out += biases). On printing the shapes of out and biases, I got [256] and [256, 1] respectively. Squeezing out the 2nd dimension from biases resolves the issue, but I am unsure whether there is a mistake with the benchmark code or with how my model is defined. I faced a similar issue on using a GRUCell. Can I please get some help?

jasonlyik commented 3 months ago

Hi Ajay, you may be running with the data shaped differently. We expect that the out tensor is shaped [4*hidden_state, batch_size], so I would expect that out should be shaped [256, 1] and not [256].

At benchmark.py:125 (batch_results[m] = self.workload_metrics[m](self.model, preds, data)), can you please check the shape of preds and data? Otherwise, it may be an issue with the hook connected to the RNNCell which tracks inputs.

Also, there is the LSTM example for a different sequence task here which may be helpful.

ajay-vikram commented 3 months ago

Hi Jason, The shapes of pred and data are [256, 2] and ([256, 1, 96], [256, 2]) respectively, where data is a tuple. These are the inputs to my model as well. What shape do you expect as input to the LSTMCell. In my case, a [1, 96] tensor goes to the LSTMCell. This [1, 96] comes from the acc_spikes in the buffering mechanism of the forward pass, similar to the one in primate_example.

jasonlyik commented 3 months ago

The shape is reasonable to me, can you check whether your code matches the code block from this previous issue #225? That works with the latest neurobench package 1.0.6, as well as any arbitrary batch size. If there is still issues, please post your code block so we can inspect the error.

ajay-vikram commented 3 months ago

Ohh, I see. I didn't get the latest version. How do I get it? Do I run .bumpversion.toml?

jasonlyik commented 3 months ago

pip install --upgrade neurobench

or if you are using poetry and a local cloned repo, then simply git pull on main branch

ajay-vikram commented 3 months ago

Still getting the same issue. Can you tell which code has been modified. Ill check if the changes have been updated.

jasonlyik commented 3 months ago

Changes are listed in #227

Please check if you can successfully run the minimal example from the code block in #225

If there is still an issue, please provide a minimal example of the model definition and harness call which causes the issue.

ajay-vikram commented 3 months ago

Yes the minimal example code runs.

Here's my model definition

class LSTM(nn.Module):
    def __init__(self, input_dim):
        super(LSTM, self).__init__()
        self.input_dim = input_dim
        self.output_dim = 2

        self.lstm = nn.LSTMCell(self.input_dim, 64)
        self.fc1 =  nn.Linear(64, 32)
        self.fc2 = nn.Linear(32, 16)
        self.fc3 = nn.Linear(16, self.output_dim)
        self.layernorm0 = nn.LayerNorm(self.input_dim)
        self.layernorm1 = nn.LayerNorm(32)
        self.layernorm2 = nn.LayerNorm(16)
        self.relu = nn.ReLU()
        self.dropout = nn.Dropout(0.3)

        self.bin_window_time = 0.2
        self.sampling_rate = 0.004
        self.bin_window_size = int(self.bin_window_time / self.sampling_rate)
        self.register_buffer("data_buffer", torch.zeros(1, self.input_dim).type(torch.float32), persistent=False)

    def single_forward(self,x):
        x = x.unsqueeze(0)
        x = self.layernorm0(x)
        (hn, cn) = self.lstm(x)
        out = self.relu(hn)
        out = self.layernorm1(self.relu(self.fc1(out)))
        out = self.dropout(out)
        out = self.layernorm2(self.relu(self.fc2(out)))
        out = self.fc3(out)
        return out

    def forward(self, x):
        predictions = []

        seq_length = x.shape[0]
        for seq in range(seq_length):
            current_seq = x[seq, :, :]
            self.data_buffer = torch.cat((self.data_buffer, current_seq), dim=0)
            if self.data_buffer.shape[0] <= self.bin_window_size:
                predictions.append(torch.zeros(1, self.output_dim).to(x.device))
            else:
                # Only pass input into model when the buffer size == bin_window_size
                if self.data_buffer.shape[0] > self.bin_window_size:
                    self.data_buffer = self.data_buffer[1:, :]

                # Accumulate
                spikes = self.data_buffer.clone()
                acc_spikes = torch.sum(spikes, dim=0)
                pred = self.single_forward(acc_spikes)
                predictions.append(pred)

        predictions = torch.stack(predictions).squeeze(dim=1)

        return predictions

ajay-vikram commented 3 months ago

This is the benchmark code

import torch
from torch.utils.data import DataLoader, Subset

from neurobench.datasets import PrimateReaching
from neurobench.models.torch_model import TorchModel
from neurobench.benchmarks import Benchmark

from ANN import ANNModel2D
from GRU import GRU
from LSTM import LSTM

all_files = ["indy_20160622_01"]
# all_files = ["indy_20160622_01", "indy_20160630_01", "indy_20170131_02", 
#              "loco_20170210_03", "loco_20170215_02", "loco_20170301_05"]

footprint = []
connection_sparsity = []
activation_sparsity = []
dense = []
macs = []
acs = []
r2 = []

device = torch.device("cuda:1" if torch.cuda.is_available() else "cpu")

for filename in all_files:
    print("Processing {}".format(filename))

    # The dataloader and preprocessor has been combined together into a single class
    data_dir = "/home/satyapreets/Ajay/neurobench/neurobench/data" # data in repo root dir
    dataset = PrimateReaching(file_path=data_dir, filename=filename,
                            num_steps=1, train_ratio=0.5, bin_width=0.004,
                            biological_delay=0, remove_segments_inactive=False)

    test_set_loader = DataLoader(Subset(dataset, dataset.ind_test), batch_size=256, shuffle=False)

    net = LSTM(input_dim=dataset.input_feature_size)
    # net = ANNModel2D(input_dim=dataset.input_feature_size, layer1=32, layer2=48, 
    #                  output_dim=2, bin_window=0.2, drop_rate=0.5)

    net.load_state_dict(torch.load("/home/satyapreets/Ajay/neurobench/mobilenet_training/experiments/vww/submission/lstm_64_indy_20160622_01.pt", map_location=device)['state_dict'])
    # net.load_state_dict(torch.load("./model_data/2D_ANN_Weight/"+filename+"_model_state_dict.pth", map_location=device))

    model = TorchModel(net)

    static_metrics = ["footprint", "connection_sparsity"]
    workload_metrics = ["r2", "activation_sparsity", "synaptic_operations"]

    # Benchmark expects the following:
    benchmark = Benchmark(model, test_set_loader, [], [], [static_metrics, workload_metrics])
    results = benchmark.run(device=device)
    print(results)

    footprint.append(results['footprint'])
    connection_sparsity.append(results['connection_sparsity'])
    activation_sparsity.append(results['activation_sparsity'])
    dense.append(results['synaptic_operations']['Dense'])
    macs.append(results['synaptic_operations']['Effective_MACs'])
    acs.append(results['synaptic_operations']['Effective_ACs'])
    r2.append(results['r2'])

print("Footprint: {}".format(footprint))
print("Connection sparsity: {}".format(connection_sparsity))
print("Activation sparsity: {}".format(activation_sparsity), sum(activation_sparsity)/len(activation_sparsity))
print("Dense: {}".format(dense), sum(dense)/len(dense))
print("MACs: {}".format(macs), sum(macs)/len(macs))
print("ACs: {}".format(acs), sum(acs)/len(acs))
print("R2: {}".format(r2), sum(r2)/len(r2))

# Footprint: [20824, 20824, 20824, 33496, 33496, 33496]
# Connection sparsity: [0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
# Activation sparsity: [0.7068512007122443, 0.7274494314849341, 0.6142621034584272, 0.6290474755671983, 0.6793054885963405, 0.6963649652600741] 0.6755467775132032
# Dense: [4702.261627687736, 4701.8430499148435, 4699.549582947173, 7773.2197567257945, 7771.01773105288, 7772.632844051291] 6236.754098729952
# MACs: [4306.322415210456, 3595.209672287623, 3607.261044176707, 5851.9819915795315, 5995.014802029395, 6462.786839756449] 4969.76279417336
# ACs: [0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 0.0
# R2: [0.6327020525932312, 0.5241347551345825, 0.6216747164726257, 0.5727078914642334, 0.4745999276638031, 0.6272222995758057] 0.5755069404840469

jasonlyik commented 3 months ago

Hi Ajay, I noticed that your LSTMCell forward call does not include the (h, c) in the inputs. Based on the documentation, if these are not included, I believe that the recurrent state of the LSTM is not tracked at all, and essentially the LSTM block is just an MLP-type transform. I may be wrong on this, though.

Regardless, note that all of our other LSTM examples use the forward convention for the LSTMCell hx, cx = rnn(input[i], (hx, cx)), and not just hx, cx = rnn(input[i]).

By making additions to your model definition shown in the below code block, there is no longer a harness runtime error:

class LSTM(nn.Module):
    def __init__(self, input_dim):
        super(LSTM, self).__init__()
        self.input_dim = input_dim
        self.output_dim = 2

        self.lstm = nn.LSTMCell(self.input_dim, 64)
        self.fc1 =  nn.Linear(64, 32)
        self.fc2 = nn.Linear(32, 16)
        self.fc3 = nn.Linear(16, self.output_dim)
        self.layernorm0 = nn.LayerNorm(self.input_dim)
        self.layernorm1 = nn.LayerNorm(32)
        self.layernorm2 = nn.LayerNorm(16)
        self.relu = nn.ReLU()
        self.dropout = nn.Dropout(0.3)

        self.bin_window_time = 0.2
        self.sampling_rate = 0.004
        self.bin_window_size = int(self.bin_window_time / self.sampling_rate)
        self.register_buffer("data_buffer", torch.zeros(1, self.input_dim).type(torch.float32), persistent=False)

        self.h = None
        self.c = None

    def single_forward(self,x):
        x = x.unsqueeze(0)
        x = self.layernorm0(x)
        self.h, self.c = self.lstm(x, (self.h, self.c))
        out = self.relu(self.h)
        out = self.layernorm1(self.relu(self.fc1(out)))
        out = self.dropout(out)
        out = self.layernorm2(self.relu(self.fc2(out)))
        out = self.fc3(out)
        return out

    def forward(self, x):
        predictions = []

        self.h = torch.zeros(1, 64).to(x.device)
        self.c = torch.zeros(1, 64).to(x.device)

        seq_length = x.shape[0]
        for seq in range(seq_length):
            current_seq = x[seq, :, :]
            self.data_buffer = torch.cat((self.data_buffer, current_seq), dim=0)
            if self.data_buffer.shape[0] <= self.bin_window_size:
                predictions.append(torch.zeros(1, self.output_dim).to(x.device))
            else:
                # Only pass input into model when the buffer size == bin_window_size
                if self.data_buffer.shape[0] > self.bin_window_size:
                    self.data_buffer = self.data_buffer[1:, :]

                # Accumulate
                spikes = self.data_buffer.clone()
                acc_spikes = torch.sum(spikes, dim=0)
                pred = self.single_forward(acc_spikes)
                predictions.append(pred)

        predictions = torch.stack(predictions).squeeze(dim=1)

        return predictions

The harness should be able to support the case where (h, c) is not passed into the LSTMCell, so this is still an issue. But I recommend that you include (h, c) in the inputs.

ajay-vikram commented 3 months ago

Aah, I see. I read somewhere in the documentation that LSTMs by default initialize their hidden and cell states to a tensor of 0s, that's why I didn't explicitly add it. Thanks a lot!!

ajay-vikram commented 3 months ago

Also will I have to retrain my models with these changes incorporated? I just changed the model but passed the same weights I had before the explicit h and c definition and the neurobench benchmarks are running fine.

jasonlyik commented 3 months ago

My guess is that you will need to retrain the model, as it is now tracking recurrent state and it wasn't before. I suggest that you take out all of the metrics except the R2 workload metric and first verify you are getting the expected accuracy before considering the compute complexity.

ajay-vikram commented 3 months ago

Alright thanks a lot!

jasonlyik commented 3 months ago

TODO: support synops for RNNCells which do not use recurrent input

NeuroBench / neurobench

RuntimeError: output with shape [256] doesn't match the broadcast shape [256, 256] #234