Closed morenzoe closed 4 months ago
Can you look into this deeper? When tracing the model's predictions preds
that is passed into the torchmetrics r2_score and the preds
sent into neurobench r2, they differ. See attached images:
torchmetrics:
neurobench call
And here is the code for the neurobench r2 score calculation, please see if it is different from the torchmetrics implementation: https://github.com/NeuroBench/neurobench/blob/main/neurobench/benchmarks/workload_metrics.py#L331
Can you look into this deeper? When tracing the model's predictions
preds
that is passed into the torchmetrics r2_score and thepreds
sent into neurobench r2, they differ. See attached images:torchmetrics:
neurobench call
Ah, you're right! Some changes seems to be made to the model by doing an inference before doing the benchmark. This keeps happening even though I have used torch.no_grad()
, model.eval()
, and even saving-loading the model. The last and easiest thing to do is to just comment out each part and test them separately since the r2 score results from both functions are reproducible as seen below (I hope you don't mind with the full screenshots in order to make things clear).
torchmetrics r2_score reproducibility testing:
neurobench r2 reproducibility testing: NOTE: yes, the neurobench r2 score result here is different from the one I posted before. This one is not affected by another inference
torchmetrics r2_score preds
:
torchmetrics r2_score result: NOTE: the result is the same as in the reproducibility testing
neurobench r2 preds
:
NOTE: the preds
here is the same as in the torchmetrics r2_score
neurobench r2 result: NOTE: the result is the same as in the reproducibility testing, but different from torchmetrics r2_score
Neurobench and TorchMetric r2 score calculation results are still different even though the preds
passed into them are the same. Is there anything wrong with my implementation? Thank you for your help!
And here is the code for the neurobench r2 score calculation, please see if it is different from the torchmetrics implementation: https://github.com/NeuroBench/neurobench/blob/main/neurobench/benchmarks/workload_metrics.py#L331
I have tried to look into neurobench r2 score calculation source code. From a quick glance, the code implementation is definitely different. However, I cannot confirm the mathematics or programming difference caused by different functions, or different calculation order, used between neurobench and torchmetrics. Here is the code for torchmetrics r2_score, any help in comparing those two would be much appreciated!
This appears to be a bug in torchmetrics, and I do believe our calculation is correct.
I saved the preds
and labels
tensors from your example, and verified your numbers. Then I calculated by hand and I find the same results as the NeuroBench metric.
The bug appears to be that for some reason, torchmetrics only calculates in the x_dim (0th dim), and the y_dim (1st dim) calculation is zero. Please see the code below and attempt to reproduce.
from neurobench.models import SNNTorchModel
from neurobench.benchmarks.workload_metrics import r2
import torch.nn as nn
import torch
from torchmetrics.functional import r2_score
'''
preds:
tensor([[-0.0121, -0.0484],
[-0.0223, -0.0170],
[ 0.0132, -0.0375],
...,
[-0.0328, -0.0567],
[-0.0332, -0.0492],
[-0.0368, -0.0483]])
(Pdb) preds.shape
torch.Size([1024, 2])
(Pdb) preds.sum()
tensor(-83.1965)
(Pdb) preds.mean()
tensor(-0.0406)
'''
preds = torch.load('preds.pt')
'''
labels:
tensor([[ 1.9836e-04, -2.0599e-04],
[ 1.8311e-04, -2.0218e-04],
[ 1.5068e-04, -1.8692e-04],
...,
[-1.4877e-04, 2.6703e-05],
[-5.7220e-05, 0.0000e+00],
[ 1.1444e-05, -1.9073e-05]])
(Pdb) labels.shape
torch.Size([1024, 2])
(Pdb) labels.sum()
tensor(-0.0336)
(Pdb) labels.mean()
tensor(-1.6404e-05)
'''
labels = torch.load('labels.pt')
'''
Output from torchmetrics, raw_values:
tensor([-9623.3994, 0.0000]) ????
'''
torch_metric = r2_score(preds, labels, multioutput='uniform_average')
print("Torch metric: ", torch_metric.item()) # -4811.6997
data = (torch.zeros(1), labels)
dummy_net = nn.Module()
model = SNNTorchModel(dummy_net)
R2 = r2()
neurobench_metric = R2(model, preds, data)
print("NeuroBench metric: ", neurobench_metric) # -76350.5390625
'''
r2_score = 1 - sum((labels - preds)^2) / sum((labels - mean(labels))^2)
'''
x_preds = preds[:, 0]
y_preds = preds[:, 1]
x_labels = labels[:, 0]
y_labels = labels[:, 1]
x_mean = x_labels.mean()
y_mean = y_labels.mean()
x_num = ((x_labels - x_preds)**2).sum()
x_den = ((x_labels - x_mean)**2).sum()
y_num = ((y_labels - y_preds)**2).sum()
y_den = ((y_labels - y_mean)**2).sum()
r2_x = 1 - x_num / x_den # tensor(-9623.3994)
r2_y = 1 - y_num / y_den # tensor(-143077.6562)
r2_manual = (r2_x + r2_y) / 2
print("Manual metric: ", r2_manual.item()) # -76350.5312
@morenzoe Closing this issue, feel free to re-open if any more comments
Hi, I would like to ask about
benchmarks.workload_metrics.r2
calculation. Is the formula used different from TorchMetrics's R2Score? I got different results from those two functions.This is my code:
This is the results I got:
Is this intended? Is there anything wrong with how I calculate R2 using TorchMetrics? Thank you in advance!