ConfusionMatrix does not work on GPU

seanytak commented 3 years ago

🐛 Bug

Hello,

When trying to utilize torchmetrics.IoU with preds and targets tensors on the GPU, I receive the following error

Traceback (most recent call last):
  File "/home/setakafu/.pyenv/versions/3.8.6/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/setakafu/.pyenv/versions/3.8.6/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/setakafu/projects/CSE/mlops/pipelines/train/steps/train.py", line 159, in <module>
    metric_iou(preds, targets)
  File "/home/setakafu/projects/CSE/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/setakafu/projects/CSE/venv/lib/python3.8/site-packages/torchmetrics/metric.py", line 168, in forward
    self.update(*args, **kwargs)
  File "/home/setakafu/projects/CSE/venv/lib/python3.8/site-packages/torchmetrics/metric.py", line 216, in wrapped_func
    return update(*args, **kwargs)
  File "/home/setakafu/projects/CSE/venv/lib/python3.8/site-packages/torchmetrics/classification/confusion_matrix.py", line 143, in update
    self.confmat += confmat
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

To Reproduce

Modified from the example to have pred and tensor on GPU

Code Sample

from torchmetrics import IoU
target = torch.randint(0, 2, (10, 25, 25))
pred = torch.tensor(target)
pred[2:5, 7:13, 9:15] = 1 - pred[2:5, 7:13, 9:15]
iou = IoU(num_classes=2)
iou(pred.to(torch.device("cuda")), target.to(torch.device("cuda")))

Expected behavior

Metric should compute as normal

Environment

PyTorch Version (e.g., 1.0): '1.8.1+cu102'
OS (e.g., Linux): Linux
How you installed PyTorch (conda, pip, source): pip
Build command you used (if compiling from source):
Python version: 3.8.6
CUDA/cuDNN version: 11.2
GPU models and configuration: Tesla K80
Any other relevant information:

Additional context

Problem appears to be due to default state of confmat in class ConfusionMatrix always creating default on CPU

Happy to submit a PR regarding the change but am not sure what would be best in line with the API signature

github-actions[bot] commented 3 years ago

Hi! thanks for your contribution!, great first issue!

Borda commented 3 years ago

seems like the data are not synced correctly...

maximsch2 commented 3 years ago

I think you should move the metric itself to cuda as well if you want to feed it data on GPU.

edgarriba commented 3 years ago

as @maximsch2 suggested you need to move the module to the same device as the input/target data. Internally, the confusion matrix will be by default in the cpu - that's why torch asserts a mismatch between devices. This the above code with the proper usage:

from torchmetrics import IoU
import torch

target = torch.randint(0, 2, (10, 25, 25))
pred = torch.tensor(target)
pred[2:5, 7:13, 9:15] = 1 - pred[2:5, 7:13, 9:15]
iou = IoU(num_classes=2).to("cuda")
iou(pred.to(torch.device("cuda")), target.to(torch.device("cuda")))

edgarriba commented 3 years ago

Closing the issue since it's not a bug, but an intended behavior.

Another possible solution could be to move the internal confusion matrix to the same device as the inputs. In that case, this should be discussed in a separated issue.

Lightning-AI / torchmetrics