Closed shaunc closed 3 years ago
The last commit is a cleaner interface IMO -- requires marker class to be passed in in order to use this functionality. Passing in "true" was really a leftover from trying to create the marker class dynamically.
I'd bet that the typing could be cleaned up somehow, to get rid of some of the casts, but not sure how.
... ok ... a stab at cleaning up types. :) make_marked_subclass
could probably use some improvement too.
@shaunc Thank you for proposing a new feature! The idea of custom Dataset base class look good to me and should be implemented. However, from the following points of view, would I propose another way of implementation?
I originally wrote DatasetMixin
only for providing shorthands, not for customizing a Dataset class (in other words, I wanted to minimize the changes from dataclasses
by xarray-dataclasses
...). I think that SomeClass.new(*args, **kwargs)
must be equivalent to asdataset(SomeClass(*args, **kwargs))
. Your implementation looks ok to work but seems to break the equivalence becauseasdataclass
needs an extra as_class
argument to be equivalent to new
.
So, rather than customizing a class by mix-in, I would like to propose a dedicated type hint to specify base class:
from typing import ClassVar, Type, TypeVar
from typing_extensions import Annotated
T = TypeVar("T")
# type hint to specify DataArray or Dataset bases
Base = ClassVar[Annotated[Type[T], "Base"]]
Then your test codes would become:
class ImageMaskedDataset(xr.Dataset):
__slots__ = tuple()
@datasetclass
class ImageMasked:
data: Data[Tuple[X, Y], float]
mask: Data[Tuple[X, Y], bool]
base: Base[xr.Dataset] = ImageMaskedDataset
The advantages of this implementation are (1) information of a base class is contained in a dataclass, (2) the information is ensured to be inherited by subclasses, and (3) the feature can also be easily used for custom DataArray base classes.
In a real implementation, we would need:
Base
in xarray_dataclasses.typing
("Base"
should be replaced by a new enum like Xarray.BASE
)is_base()
in xarray_dataclasses.typing
get_base()
in xarray_dataclasses.common
asdataset()
so that it uses get_base()
inside itAnyway, this is my rough proposal and I may miss something important in your implementation that I did not mention. Please feel free to comment on that and I am happy to continue to discuss!
@astropenguin I like your idea. I'll think about it in more detail over the weekend, and see if I can implement as you suggest, or if there might be additional characteristics I've left implicit. Thanks!
I like the form of the idea, but don't actually like the word "Base". Actually the class specified is the leaf of the xr.Dataset
hierarchy, with xr.Dataset
itself being the (or a) base. How about "DatasetClass"? or just "Dataset" (or even just "Class")?
I presume that in base: Base[xr.Dataset] = ImageMaskedDataset
, the identifier base
is arbitrary. What matters is that there is some (and no more than one) class member whose type is DatasetClass
(or whatever it is called). If we do have a convention (at least for examples), what about __dataset_class__
? (Not polluting the namespace of the dataclass was why I originally didn't implement with a class member.)
Do we need to expose the annotation of Base
to the user? Is this so that we can use it for xr.DataArray
as well? Perhaps the machinery can be generic, but I don't see a need to put it in the interface (?) just perhaps:
@datasetclass
class ImageMasked:
data: Data[Tuple[X, Y], float]
mask: Data[Tuple[X, Y], bool]
__dataset_class__: DatasetClass = ImageMaskedDataset
Where if we need it we might have somewhere in the implementation:
_DataClass = ClassVar[Annotated[Type[T], "_DataClass"]]
DatasetClass = _DataClass[xr.Dataset]
Hmm... one issue I have with both my implementation and your proposal is that new
isn't typed right.
What about, instead:
@datasetclass
class ImageMasked(DatasetMixin[ImageMaskedDataset]):
data: Data[Tuple[X, Y], float]
mask: Data[Tuple[X, Y], bool]
And DatasetMixin
would have a new
that returned that type? @datasetclass
could just omit extension when it saw that. That seems like the cleanest way I can think of right now.
Update -- proof of concept implementation: https://github.com/astropenguin/xarray-dataclasses/pull/46#issue-608352874
@shaunc I am sorry for long silence... And thank you for proposing the new implementation.
new
isn't typed right.
The generic mixin class works fine, but I think that a generic protocol class exclusively for dataset is simpler in this case. For example:
T = TypeVar("T", covariant=True, bound=xr.Dataset)
class DatasetClass(Protocol[T]):
__init__: Callable[..., None]
__dataset_factory__: Callable[[xr.Dataset], T]
__dataclass_fields__ = Dict[str, Field[Any]]
def asdataset(inst: DatasetClass[T]) -> T:
dataset = inst_to_dataset(inst) # pseudo code
return inst.__dataset_factory__(dataset)
where __dataset_factory__
is a callable (either class or function is ok) which converts an instance of xr.Dataset
to that of a custom dataset class. Then the following code properly types an output dataset:
class CustomDataset(xr.Dataset):
__slots__ = ()
@dataclass
class Custom:
data: Data[tuple["x", "y"], float]
__dataset_factory__ = CustomDataset
ds = asdataset(Custom(...)) # properly typed as CustomDataset
Because asdataset
does the job, we do not need to change DatasetMixin
so much:
class DatasetMixin:
@classmethod
def new(
cls: Type[DatasetClass[T]], # the only change
*args: Any,
**kwargs: Any,
) -> T:
inst = cls(*args, **kwargs)
return asdataset(inst)
Then the following code works as well:
@dataclass
class Custom(DatasetMixin):
data: Data[tuple["x", "y"], float]
__dataset_factory__ = CustomDataset
ds = Custom.new(...) # properly typed as CustomDataset
Could you let me have time to implement in another branch to check if it really works well (I only checked it in a separate code from xarray_dataclasses
...)?
@shaunc Please let me close the issue because a similar feature was implemented by #50 and is now available in v0.4.0. Please feel free to raise any issues if you have some requests on the feature. Thank you!
The goal is that the returned xarray (optionally) can be distinguished with
isinstance
from a generic xarray.My use case is to use with
functools.singledispatch
-- I have two datasets representing different kinds of patterns, sharing 9 data arrays in common but with two variants. I would like to define variant processing in some cases using@signledispatch
. I have added a test of this functionality.Some of the type-related code is awkward. I was originally going to have a generic class as the marker "
DatasetOf[Foo]
" whereFoo
is the datasetclass, but seems that single dispatch won't branch on such.If you like the idea but want a different implementation, I'm happy to amend.