Different R2 benchmark result compared to TorchMetrics

morenzoe commented 4 months ago

Hi, I would like to ask about benchmarks.workload_metrics.r2 calculation. Is the formula used different from TorchMetrics's R2Score? I got different results from those two functions.

This is my code:

import torch #2.3.0
from neurobench.datasets import PrimateReaching #1.0.5
from neurobench.models.torch_model import TorchModel
from neurobench.benchmarks import Benchmark
from torch.utils.data import DataLoader, Subset
from torchmetrics.functional import r2_score
import random
import numpy as np

torch.manual_seed(42)
random.seed(42)
np.random.seed(42)

def seed_worker(worker_id):
    np.random.seed(42)
    random.seed(42)

g = torch.Generator()
g.manual_seed(42)

class LSTMCELL(torch.nn.Module):
    def __init__(self, hidden):
        super().__init__()

        self.hidden = hidden

        self.lstm_cell = torch.nn.LSTMCell(96, self.hidden)  # [channel, hidden_size]
        self.fc = torch.nn.Linear(self.hidden, out_features=2)

    def forward(self, x):
        # x [batch_size, num_steps, channel]
        hx = torch.randn(x.shape[0], self.hidden)  # [batch_size, hidden_size]
        cx = torch.randn(x.shape[0], self.hidden)

        for i in range(x.shape[1]):
            hx, cx = self.lstm_cell(x[:, i, :], (hx, cx))

        x = self.fc(hx)

        return x

filename = "indy_20160622_01"
data_dir = ".../neurobench/data"

dataset = PrimateReaching(file_path=data_dir, filename=filename,
                        num_steps=7, train_ratio=0.5, bin_width=0.004, label_series=False,
                        biological_delay=0, remove_segments_inactive=False, download=False)

# testing with one batch only
dataloader = DataLoader(Subset(dataset, dataset.ind_train[0:1024]), batch_size=1024, shuffle=False, worker_init_fn=seed_worker, generator=g)

model = LSTMCELL(256)

model = TorchModel(model)

with torch.no_grad():
    for samples, labels in iter(dataloader):
        preds = model(samples)
        r2 = r2_score(preds, labels, multioutput='uniform_average')
        # print(mem)
        print(r2)

static_metrics = []
workload_metrics = ["r2"]

benchmark = Benchmark(model, dataloader, [], [], [static_metrics, workload_metrics])
results = benchmark.run()
print(results)

This is the results I got:

Loading indy_20160622_01.mat
tensor(-4811.6987)
Running benchmark
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  4.29it/s] 
{'r2': -76530.734375}

Is this intended? Is there anything wrong with how I calculate R2 using TorchMetrics? Thank you in advance!

jasonlyik commented 4 months ago

Can you look into this deeper? When tracing the model's predictions preds that is passed into the torchmetrics r2_score and the preds sent into neurobench r2, they differ. See attached images:

torchmetrics:

neurobench call

jasonlyik commented 4 months ago

And here is the code for the neurobench r2 score calculation, please see if it is different from the torchmetrics implementation: https://github.com/NeuroBench/neurobench/blob/main/neurobench/benchmarks/workload_metrics.py#L331

morenzoe commented 4 months ago

Can you look into this deeper? When tracing the model's predictions preds that is passed into the torchmetrics r2_score and the preds sent into neurobench r2, they differ. See attached images:

torchmetrics:

neurobench call

Ah, you're right! Some changes seems to be made to the model by doing an inference before doing the benchmark. This keeps happening even though I have used torch.no_grad(), model.eval(), and even saving-loading the model. The last and easiest thing to do is to just comment out each part and test them separately since the r2 score results from both functions are reproducible as seen below (I hope you don't mind with the full screenshots in order to make things clear).

torchmetrics r2_score reproducibility testing: Code_402

neurobench r2 reproducibility testing: Code_403 NOTE: yes, the neurobench r2 score result here is different from the one I posted before. This one is not affected by another inference

torchmetrics r2_score preds: Code_406

torchmetrics r2_score result: Code_407 NOTE: the result is the same as in the reproducibility testing

neurobench r2 preds: Code_404 NOTE: the preds here is the same as in the torchmetrics r2_score

neurobench r2 result: Code_405 NOTE: the result is the same as in the reproducibility testing, but different from torchmetrics r2_score

Neurobench and TorchMetric r2 score calculation results are still different even though the preds passed into them are the same. Is there anything wrong with my implementation? Thank you for your help!

morenzoe commented 4 months ago

And here is the code for the neurobench r2 score calculation, please see if it is different from the torchmetrics implementation: https://github.com/NeuroBench/neurobench/blob/main/neurobench/benchmarks/workload_metrics.py#L331

I have tried to look into neurobench r2 score calculation source code. From a quick glance, the code implementation is definitely different. However, I cannot confirm the mathematics or programming difference caused by different functions, or different calculation order, used between neurobench and torchmetrics. Here is the code for torchmetrics r2_score, any help in comparing those two would be much appreciated!

https://github.com/Lightning-AI/torchmetrics/blob/v1.4.0.post0/src/torchmetrics/regression/r2.py#L28-L183

jasonlyik commented 4 months ago

This appears to be a bug in torchmetrics, and I do believe our calculation is correct.

I saved the preds and labels tensors from your example, and verified your numbers. Then I calculated by hand and I find the same results as the NeuroBench metric.

The bug appears to be that for some reason, torchmetrics only calculates in the x_dim (0th dim), and the y_dim (1st dim) calculation is zero. Please see the code below and attempt to reproduce.

from neurobench.models import SNNTorchModel
from neurobench.benchmarks.workload_metrics import r2
import torch.nn as nn
import torch
from torchmetrics.functional import r2_score

'''
preds:
tensor([[-0.0121, -0.0484],
        [-0.0223, -0.0170],
        [ 0.0132, -0.0375],
        ...,
        [-0.0328, -0.0567],
        [-0.0332, -0.0492],
        [-0.0368, -0.0483]])
(Pdb) preds.shape
torch.Size([1024, 2])
(Pdb) preds.sum()
tensor(-83.1965)
(Pdb) preds.mean()
tensor(-0.0406)
'''
preds = torch.load('preds.pt')

'''
labels:
tensor([[ 1.9836e-04, -2.0599e-04],
        [ 1.8311e-04, -2.0218e-04],
        [ 1.5068e-04, -1.8692e-04],
        ...,
        [-1.4877e-04,  2.6703e-05],
        [-5.7220e-05,  0.0000e+00],
        [ 1.1444e-05, -1.9073e-05]])
(Pdb) labels.shape
torch.Size([1024, 2])
(Pdb) labels.sum()
tensor(-0.0336)
(Pdb) labels.mean()
tensor(-1.6404e-05)
'''
labels = torch.load('labels.pt')

'''
Output from torchmetrics, raw_values:
tensor([-9623.3994,     0.0000]) ????
'''
torch_metric = r2_score(preds, labels, multioutput='uniform_average') 
print("Torch metric: ", torch_metric.item()) # -4811.6997

data = (torch.zeros(1), labels)
dummy_net = nn.Module()
model = SNNTorchModel(dummy_net)
R2 = r2()
neurobench_metric = R2(model, preds, data) 
print("NeuroBench metric: ", neurobench_metric) # -76350.5390625

'''
r2_score = 1 - sum((labels - preds)^2) / sum((labels - mean(labels))^2)
'''
x_preds = preds[:, 0]
y_preds = preds[:, 1]
x_labels = labels[:, 0]
y_labels = labels[:, 1]

x_mean = x_labels.mean()
y_mean = y_labels.mean()

x_num = ((x_labels - x_preds)**2).sum()
x_den = ((x_labels - x_mean)**2).sum()

y_num = ((y_labels - y_preds)**2).sum()
y_den = ((y_labels - y_mean)**2).sum()

r2_x = 1 - x_num / x_den # tensor(-9623.3994)
r2_y = 1 - y_num / y_den # tensor(-143077.6562)

r2_manual = (r2_x + r2_y) / 2 
print("Manual metric: ", r2_manual.item()) # -76350.5312

jasonlyik commented 4 months ago

@morenzoe Closing this issue, feel free to re-open if any more comments

NeuroBench / neurobench

Different R2 benchmark result compared to TorchMetrics #230