Lightning-AI / torchmetrics

Torchmetrics - Machine learning metrics for distributed, scalable PyTorch applications.
https://lightning.ai/docs/torchmetrics/
Apache License 2.0
2.08k stars 398 forks source link

ConfusionMatrix does not work on GPU #275

Closed seanytak closed 3 years ago

seanytak commented 3 years ago

🐛 Bug

Hello,

When trying to utilize torchmetrics.IoU with preds and targets tensors on the GPU, I receive the following error

Traceback (most recent call last):
  File "/home/setakafu/.pyenv/versions/3.8.6/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/setakafu/.pyenv/versions/3.8.6/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/setakafu/projects/CSE/mlops/pipelines/train/steps/train.py", line 159, in <module>
    metric_iou(preds, targets)
  File "/home/setakafu/projects/CSE/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/setakafu/projects/CSE/venv/lib/python3.8/site-packages/torchmetrics/metric.py", line 168, in forward
    self.update(*args, **kwargs)
  File "/home/setakafu/projects/CSE/venv/lib/python3.8/site-packages/torchmetrics/metric.py", line 216, in wrapped_func
    return update(*args, **kwargs)
  File "/home/setakafu/projects/CSE/venv/lib/python3.8/site-packages/torchmetrics/classification/confusion_matrix.py", line 143, in update
    self.confmat += confmat
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

To Reproduce

Modified from the example to have pred and tensor on GPU

Code Sample

from torchmetrics import IoU
target = torch.randint(0, 2, (10, 25, 25))
pred = torch.tensor(target)
pred[2:5, 7:13, 9:15] = 1 - pred[2:5, 7:13, 9:15]
iou = IoU(num_classes=2)
iou(pred.to(torch.device("cuda")), target.to(torch.device("cuda")))

Expected behavior

Metric should compute as normal

Environment

Additional context

Problem appears to be due to default state of confmat in class ConfusionMatrix always creating default on CPU

Happy to submit a PR regarding the change but am not sure what would be best in line with the API signature

github-actions[bot] commented 3 years ago

Hi! thanks for your contribution!, great first issue!

Borda commented 3 years ago

seems like the data are not synced correctly...

maximsch2 commented 3 years ago

I think you should move the metric itself to cuda as well if you want to feed it data on GPU.

edgarriba commented 3 years ago

as @maximsch2 suggested you need to move the module to the same device as the input/target data. Internally, the confusion matrix will be by default in the cpu - that's why torch asserts a mismatch between devices. This the above code with the proper usage:

from torchmetrics import IoU
import torch

target = torch.randint(0, 2, (10, 25, 25))
pred = torch.tensor(target)
pred[2:5, 7:13, 9:15] = 1 - pred[2:5, 7:13, 9:15]
iou = IoU(num_classes=2).to("cuda")
iou(pred.to(torch.device("cuda")), target.to(torch.device("cuda")))
edgarriba commented 3 years ago

Closing the issue since it's not a bug, but an intended behavior.

Another possible solution could be to move the internal confusion matrix to the same device as the inputs. In that case, this should be discussed in a separated issue.