Closed schwobr closed 3 years ago
I agree, apparently the only reason why not to is the rewriting :) so go ahead, it's a good idea!
I'd also like to use this occasion to add some things that I feel are missing in the patch csv:
dx
and dy
represents its size at level 0). Either by adding the downsample ratio (which is not always 2**level
, for instance in svs files), or by having both the level 0 size and the actual size.f"{slidename}{id}"
. That way each patch of itch slide has a unique id for the whole dataset, that could come handy for some applications and doesn't require much change.Also I think we should have psize
be a tuple as well, and have Coord
optionally accept a single int as input, in which case both coordinates are the same.
For now it looks like:
@dataclass(frozen=True)
class Patch:
id: str
slidename: str
position: Coord
level: int
size: Coord
size_0: Coord
parent: Optional[Patch] = None
@classmethod
def get_fields(cls) -> List[str]:
return [
"id",
"global_id",
"x",
"y",
"level",
"dx",
"dy",
"size_x",
"size_y",
"parent",
]
def to_csv_row(self) -> Dict[str, Union[str, int]]:
return {
"id": self.id,
"global_id": self.slidename+self.id,
"x": self.position[0],
"y": self.position[1],
"level": self.level,
"dx": self.size_0[0],
"dy": self.size_0[1],
"size_x": self.size[0],
"size_y": self.size[1],
"parent": "None" if self.parent is None else self.parent.id
}
Being a very simple data structure, but being instantiated many times, wouldn't it be smart to use __slots__
in this class?
Actually I don't know if typehints and recent versions of python still need it, but I found the concept interesting.
You can have a look at this: https://stackoverflow.com/questions/1336791/dictionary-vs-object-which-is-more-efficient-and-why
Tell me if you test it, I'm curious.
We could probably use them but I'm not sure it is necessary. We gain maximum 10% performance and it looks to me as it is not compatible with dataclass
as it needs a __dict__
to be created. We could just not use dataclass
then but it is quite practical, especially because it defines __eq__
automatically.
I might still try it and check if it is significantally faster, in which case it could be worth it.
I have been thinking about ways to make the code more easily understandable and have less compatibility issues within it. One thing that struck me as quite impractical was the fact that
interval
oroffset
should always be defined as a dictctionary with set keys such as{"x": 0, "y": 0}
. The same goes for the patch dictionaries, they should always have the same keys. That's why I think we should switch to named tuples for coordinates and to a dataclass for patches. The idea is to define something like:This change comes with pros and cons. Pros:
interval
couple or anoffset
, I don't think there is a particular reason for using dictionaries instead.interval
is a basic tuple (which avoid complicating user inputs) but have our api returnCoord
objects.x, y = interval
. With dicts this would assign the keys and not the values, which leads tox = 'x' and y = 'y'
.if isinstance(interval, dict): interval = Coord(**interval)
.Patch
quite easily without having to look for all places where the field names are mentioned. We just usefieldnames=list(Patch.__annotations__)
andwriter.writerow(vars(patch))
wherepatch
is an instance ofPatch
.get_children
for hierarchical extraction.parent
be aPatch
instance. It complicates csv writing a bit, but we can easily work around it.Cons:
patch = Patch(id=id, x=x, y=y, level=level, dx=dx, dy=dy, parent=parent)
is as transparent as before though.