Replace dictionaries by defined classes or named tuples

schwobr commented 3 years ago

I have been thinking about ways to make the code more easily understandable and have less compatibility issues within it. One thing that struck me as quite impractical was the fact that interval or offset should always be defined as a dictctionary with set keys such as {"x": 0, "y": 0}. The same goes for the patch dictionaries, they should always have the same keys. That's why I think we should switch to named tuples for coordinates and to a dataclass for patches. The idea is to define something like:

class Coord(NamedTuple):
    x: int
    y: int

@dataclass(frozen=True)
class Patch:
    id: str
    x: int
    y: int
    level: int
    dx: int
    dy: int
    parent: str

This change comes with pros and cons. Pros:

Allows for clear type hints. A coordinate should be a tuple of integers, a patch should be an object which has an id, coordinates, etc.
As humans, we always think of coordinates as tuples. When we ask for an interval couple or an offset, I don't think there is a particular reason for using dictionaries instead.
Named tuples are tuples, which means we can always suppose something like interval is a basic tuple (which avoid complicating user inputs) but have our api return Coord objects.
Tuples can have their values unpacked, which means we can do things like x, y = interval. With dicts this would assign the keys and not the values, which leads to x = 'x' and y = 'y'.
We can make it backward compatible and add a warning for scripts that we coded using the dictionary definition. It just needs if isinstance(interval, dict): interval = Coord(**interval).
Frozen dataclasses and tuple are immutable, which ensures that error are raised if we try doing using other attributes for instance. That can be practical for debugging.
When it comes down to writing csvs, it allows us to make changes in what we want to include in a Patch quite easily without having to look for all places where the field names are mentioned. We just use fieldnames=list(Patch.__annotations__) and writer.writerow(vars(patch)) where patch is an instance of Patch.
We can implement methods, such as get_children for hierarchical extraction.
We could also imagine having parent be a Patch instance. It complicates csv writing a bit, but we can easily work around it.

Cons:

Source code becomes more complex. I still think defining a patch by patch = Patch(id=id, x=x, y=y, level=level, dx=dx, dy=dy, parent=parent) is as transparent as before though.
It is not a required change per say.
Some tests need to be rewritten.
Well it actually implies some heavy refactoring, but I'm on it.

pilarOrtega commented 3 years ago

I agree, apparently the only reason why not to is the rewriting :) so go ahead, it's a good idea!

schwobr commented 3 years ago

I'd also like to use this occasion to add some things that I feel are missing in the patch csv:

The real size of the patch (dx and dy represents its size at level 0). Either by adding the downsample ratio (which is not always 2**level, for instance in svs files), or by having both the level 0 size and the actual size.
I think it could be good to have the slidename in it. It is not absolutely necessary as it is the name of the csv, but if we want to fuse all csvs in one for some reason, having the slidename in it will make the process very straightforward. I would do so by adding a global id that looks like f"{slidename}{id}". That way each patch of itch slide has a unique id for the whole dataset, that could come handy for some applications and doesn't require much change.

schwobr commented 3 years ago

Also I think we should have psize be a tuple as well, and have Coord optionally accept a single int as input, in which case both coordinates are the same.

schwobr commented 3 years ago

For now it looks like:

@dataclass(frozen=True)
class Patch:
    id: str
    slidename: str
    position: Coord
    level: int
    size: Coord
    size_0: Coord
    parent: Optional[Patch] = None

    @classmethod
    def get_fields(cls) -> List[str]:
        return [
            "id",
            "global_id",
            "x",
            "y",
            "level",
            "dx",
            "dy",
            "size_x",
            "size_y",
            "parent",
        ]

    def to_csv_row(self) -> Dict[str, Union[str, int]]:
        return {
            "id": self.id,
            "global_id": self.slidename+self.id,
            "x": self.position[0],
            "y": self.position[1],
            "level": self.level,
            "dx": self.size_0[0],
            "dy": self.size_0[1],
            "size_x": self.size[0],
            "size_y": self.size[1],
            "parent": "None" if self.parent is None else self.parent.id
        }

ArnaudAbreu commented 3 years ago

Being a very simple data structure, but being instantiated many times, wouldn't it be smart to use __slots__ in this class? Actually I don't know if typehints and recent versions of python still need it, but I found the concept interesting. You can have a look at this: https://stackoverflow.com/questions/1336791/dictionary-vs-object-which-is-more-efficient-and-why Tell me if you test it, I'm curious.

schwobr commented 3 years ago

We could probably use them but I'm not sure it is necessary. We gain maximum 10% performance and it looks to me as it is not compatible with dataclass as it needs a __dict__ to be created. We could just not use dataclass then but it is quite practical, especially because it defines __eq__ automatically.

I might still try it and check if it is significantally faster, in which case it could be worth it.

MicroMedIAn / PathAIA

Replace dictionaries by defined classes or named tuples #22