MicroMedIAn / PathAIA

Digital Pathology Analysis Tools
GNU General Public License v3.0
3 stars 3 forks source link

Replace dictionaries by defined classes or named tuples #22

Closed schwobr closed 3 years ago

schwobr commented 3 years ago

I have been thinking about ways to make the code more easily understandable and have less compatibility issues within it. One thing that struck me as quite impractical was the fact that interval or offset should always be defined as a dictctionary with set keys such as {"x": 0, "y": 0}. The same goes for the patch dictionaries, they should always have the same keys. That's why I think we should switch to named tuples for coordinates and to a dataclass for patches. The idea is to define something like:

class Coord(NamedTuple):
    x: int
    y: int

@dataclass(frozen=True)
class Patch:
    id: str
    x: int
    y: int
    level: int
    dx: int
    dy: int
    parent: str

This change comes with pros and cons. Pros:

Cons:

pilarOrtega commented 3 years ago

I agree, apparently the only reason why not to is the rewriting :) so go ahead, it's a good idea!

schwobr commented 3 years ago

I'd also like to use this occasion to add some things that I feel are missing in the patch csv:

schwobr commented 3 years ago

Also I think we should have psize be a tuple as well, and have Coord optionally accept a single int as input, in which case both coordinates are the same.

schwobr commented 3 years ago

For now it looks like:

@dataclass(frozen=True)
class Patch:
    id: str
    slidename: str
    position: Coord
    level: int
    size: Coord
    size_0: Coord
    parent: Optional[Patch] = None

    @classmethod
    def get_fields(cls) -> List[str]:
        return [
            "id",
            "global_id",
            "x",
            "y",
            "level",
            "dx",
            "dy",
            "size_x",
            "size_y",
            "parent",
        ]

    def to_csv_row(self) -> Dict[str, Union[str, int]]:
        return {
            "id": self.id,
            "global_id": self.slidename+self.id,
            "x": self.position[0],
            "y": self.position[1],
            "level": self.level,
            "dx": self.size_0[0],
            "dy": self.size_0[1],
            "size_x": self.size[0],
            "size_y": self.size[1],
            "parent": "None" if self.parent is None else self.parent.id
        }
ArnaudAbreu commented 3 years ago

Being a very simple data structure, but being instantiated many times, wouldn't it be smart to use __slots__ in this class? Actually I don't know if typehints and recent versions of python still need it, but I found the concept interesting. You can have a look at this: https://stackoverflow.com/questions/1336791/dictionary-vs-object-which-is-more-efficient-and-why Tell me if you test it, I'm curious.

schwobr commented 3 years ago

We could probably use them but I'm not sure it is necessary. We gain maximum 10% performance and it looks to me as it is not compatible with dataclass as it needs a __dict__ to be created. We could just not use dataclass then but it is quite practical, especially because it defines __eq__ automatically.

I might still try it and check if it is significantally faster, in which case it could be worth it.