Lightning-AI / torchmetrics

Torchmetrics - Machine learning metrics for distributed, scalable PyTorch applications.
https://lightning.ai/docs/torchmetrics/
Apache License 2.0
2.07k stars 395 forks source link

MeanAvaragePregicision (segm): 'tuple' object has no attribute 'cpu' #1239

Closed AleRiccardi closed 1 year ago

AleRiccardi commented 1 year ago

The problem rises when we instantiate the MeanAvaragePrecision class of type segm, call the update(...) method, and finally the cpu() method. I personally have no reason for calling the cpu() method, but pythorch lightning does. At the end of a training, it tries to place every inner module on the CPU (you can find the full traceback of the error at the bottom of this issue which proves what I am saying). This triggers the self._apply(...) method in the metrics.py file of this package, which rises the following error: AttributeError: 'tuple' object has no attribute 'cpu'.

The reason this is happening is that every time we update the metrics we call the following method:

Source

    def _get_safe_item_values(self, item: Dict[str, Any]) -> Union[Tensor, Tuple]:

        if self.iou_type == "bbox":
            boxes = _fix_empty_tensors(item["boxes"])
            if boxes.numel() > 0:
                boxes = box_convert(boxes, in_fmt=self.box_format, out_fmt="xyxy")
            return boxes
        elif self.iou_type == "segm":
            masks = []

            for i in item["masks"].cpu().numpy():
                rle = mask_utils.encode(np.asfortranarray(i))
                masks.append((tuple(rle["size"]), rle["counts"]))

            return tuple(masks)
        else:
            raise Exception(f"IOU type {self.iou_type} is not supported")

This changes the mask type from Tensor to Tuple and then update the self.detections list in the below code:

Source

        for item in preds:

            detections = self._get_safe_item_values(item)

            self.detections.append(detections)
            self.detection_labels.append(item["labels"])
            self.detection_scores.append(item["scores"])

Finally, when the _apply(...) method is called, it tries to move every element of the self.detections list to the CPU device. But because every element is a Tuple it raises the mentioned error. In fact, a tuple does not implement the cpu() method.

Source

            current_val = getattr(this, key)
            if isinstance(current_val, Tensor):
                setattr(this, key, fn(current_val))
            elif isinstance(current_val, Sequence):
                setattr(this, key, [fn(cur_v) for cur_v in current_val])

The fn(...) method is declared here:

Source

        return self._apply(lambda t: t.cpu())

Full error traceback:

  File "/home/alessandror/Projects/ml-package/examples/train.py", line 16, in main                                                                                                            
    trainer.fit(model, data, ckpt_path=ckpt_path)                                                                                                                                             
  File "/home/alessandror/.miniconda3/envs/ml-package/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 740, in fit
    self._call_and_handle_interrupt(
  File "/home/alessandror/.miniconda3/envs/ml-package/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 685, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/home/alessandror/.miniconda3/envs/ml-package/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 777, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/home/alessandror/.miniconda3/envs/ml-package/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1202, in _run
    self._post_dispatch()
  File "/home/alessandror/.miniconda3/envs/ml-package/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1267, in _post_dispatch
    self.accelerator.teardown()
  File "/home/alessandror/.miniconda3/envs/ml-package/lib/python3.9/site-packages/pytorch_lightning/accelerators/gpu.py", line 79, in teardown
    super().teardown()
  File "/home/alessandror/.miniconda3/envs/ml-package/lib/python3.9/site-packages/pytorch_lightning/accelerators/accelerator.py", line 190, in teardown
    self.training_type_plugin.teardown()
  File "/home/alessandror/.miniconda3/envs/ml-package/lib/python3.9/site-packages/pytorch_lightning/plugins/training_type/single_device.py", line 86, in teardown
    self.lightning_module.cpu()
  File "/home/alessandror/.miniconda3/envs/ml-package/lib/python3.9/site-packages/pytorch_lightning/core/mixins/device_dtype_mixin.py", line 137, in cpu
    return super().cpu()
  File "/home/alessandror/.miniconda3/envs/ml-package/lib/python3.9/site-packages/torch/nn/modules/module.py", line 711, in cpu
    return self._apply(lambda t: t.cpu())
  File "/home/alessandror/.miniconda3/envs/ml-package/lib/python3.9/site-packages/torch/nn/modules/module.py", line 570, in _apply
    module._apply(fn)
  File "/home/alessandror/.miniconda3/envs/ml-package/lib/python3.9/site-packages/torchmetrics/metric.py", line 673, in _apply
    setattr(this, key, [fn(cur_v) for cur_v in current_val])
  File "/home/alessandror/.miniconda3/envs/ml-package/lib/python3.9/site-packages/torchmetrics/metric.py", line 673, in <listcomp>
    setattr(this, key, [fn(cur_v) for cur_v in current_val])
  File "/home/alessandror/.miniconda3/envs/ml-package/lib/python3.9/site-packages/torch/nn/modules/module.py", line 711, in <lambda>
    return self._apply(lambda t: t.cpu())
AttributeError: 'tuple' object has no attribute 'cpu'
github-actions[bot] commented 1 year ago

Hi! thanks for your contribution!, great first issue!

SkafteNicki commented 1 year ago

Hi @AleRiccardi, Thanks for raising this issue. After taking a look at it (thanks for all the info you have provided) it is a tricky issue, which stems from the limitation that torchmetrics states by default only can be tensors or list of tensors. However, for this metric with `iou_type = "segm" we actually need list of tuples of tensors (so on extra layer of nested structure).

I can think of two solutions:

  1. Either we refactor metric states to allow for more types and nested types of structures, which in principal is not that hard but it would still be substantial changes to the codebase.
  2. We can solve this issue by implementing a TensorTuple that implements all the common methods like .cpu, .cuda etc. I added an example of what that could look like. Then we replace the call to tuple(masks) with TensorTuple(masks) and everything should somewhat work (I think something still need to be changed for ddp to work).
    TensorTuple.py
from typing import Callable, Sequence, TypeVar, Optional, Union
from torch.nn import Module
import torch
from torch import Tensor, device, dtype

T = TypeVar('T', bound='TensorTuple')

class TensorTuple(tuple):
    def _apply(self, fn: Callable) -> Module:
        vals = [ ]
        for val in self:
            vals.append(fn(val))
        return TensorTuple(vals)

    def cuda(self: T, device: Optional[Union[int, device]] = None) -> T:
        return self._apply(lambda t: t.cuda(device))

    def ipu(self: T, device: Optional[Union[int, device]] = None) -> T:
        return self._apply(lambda t: t.ipu(device))

    def xpu(self: T, device: Optional[Union[int, device]] = None) -> T:
        return self._apply(lambda t: t.xpu(device))

    def cpu(self: T) -> T:
        return self._apply(lambda t: t.cpu())

    def type(self: T, dst_type: Union[dtype, str]) -> T:
        return self._apply(lambda t: t.type(dst_type))

    def float(self: T) -> T:
        return self._apply(lambda t: t.float() if t.is_floating_point() else t)

    def double(self: T) -> T:
        return self._apply(lambda t: t.double() if t.is_floating_point() else t)

    def half(self: T) -> T:
        return self._apply(lambda t: t.half() if t.is_floating_point() else t)

    def bfloat16(self: T) -> T:
        return self._apply(lambda t: t.bfloat16() if t.is_floating_point() else t)

    def to_empty(self: T, *, device: Union[str, device]) -> T:
        return self._apply(lambda t: torch.empty_like(t, device=device))

@justusschock what do you think we should do?

justusschock commented 1 year ago

@SkafteNicki I would not used 2. as it is easy to break something we are not aware of that way.

I suggest introducing https://github.com/Lightning-AI/utilities as a dependency and rely on https://github.com/Lightning-AI/utilities/blob/main/src/lightning_utilities/core/apply_func.py for this case (similar to what PL does in several cases). This way you could nest how deep you wish and then use apply_to_collection with dtype=torch.Tensor to map this to all levels of nested collectives.

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

tomliumd commented 6 months ago

would be great to have a solution here?