astropenguin / xarray-dataclasses

:zap: xarray data creation by data classes
https://astropenguin.github.io/xarray-dataclasses/
MIT License
71 stars 4 forks source link

Optional marker class as argument to datasetclass decorator #41

Closed shaunc closed 3 years ago

shaunc commented 3 years ago

The goal is that the returned xarray (optionally) can be distinguished with isinstance from a generic xarray.

My use case is to use with functools.singledispatch -- I have two datasets representing different kinds of patterns, sharing 9 data arrays in common but with two variants. I would like to define variant processing in some cases using @signledispatch. I have added a test of this functionality.

Some of the type-related code is awkward. I was originally going to have a generic class as the marker "DatasetOf[Foo]" where Foo is the datasetclass, but seems that single dispatch won't branch on such.

If you like the idea but want a different implementation, I'm happy to amend.

shaunc commented 3 years ago

The last commit is a cleaner interface IMO -- requires marker class to be passed in in order to use this functionality. Passing in "true" was really a leftover from trying to create the marker class dynamically.

I'd bet that the typing could be cleaned up somehow, to get rid of some of the casts, but not sure how.

shaunc commented 3 years ago

... ok ... a stab at cleaning up types. :) make_marked_subclass could probably use some improvement too.

astropenguin commented 3 years ago

@shaunc Thank you for proposing a new feature! The idea of custom Dataset base class look good to me and should be implemented. However, from the following points of view, would I propose another way of implementation?

I originally wrote DatasetMixin only for providing shorthands, not for customizing a Dataset class (in other words, I wanted to minimize the changes from dataclasses by xarray-dataclasses...). I think that SomeClass.new(*args, **kwargs) must be equivalent to asdataset(SomeClass(*args, **kwargs)). Your implementation looks ok to work but seems to break the equivalence becauseasdataclass needs an extra as_class argument to be equivalent to new.

So, rather than customizing a class by mix-in, I would like to propose a dedicated type hint to specify base class:

from typing import ClassVar, Type, TypeVar
from typing_extensions import Annotated
T = TypeVar("T")

# type hint to specify DataArray or Dataset bases
Base = ClassVar[Annotated[Type[T], "Base"]]

Then your test codes would become:

class ImageMaskedDataset(xr.Dataset):
    __slots__ = tuple()

@datasetclass
class ImageMasked:
    data: Data[Tuple[X, Y], float]
    mask: Data[Tuple[X, Y], bool]
    base: Base[xr.Dataset] = ImageMaskedDataset

The advantages of this implementation are (1) information of a base class is contained in a dataclass, (2) the information is ensured to be inherited by subclasses, and (3) the feature can also be easily used for custom DataArray base classes.

astropenguin commented 3 years ago

In a real implementation, we would need:

Anyway, this is my rough proposal and I may miss something important in your implementation that I did not mention. Please feel free to comment on that and I am happy to continue to discuss!

shaunc commented 3 years ago

@astropenguin I like your idea. I'll think about it in more detail over the weekend, and see if I can implement as you suggest, or if there might be additional characteristics I've left implicit. Thanks!

shaunc commented 3 years ago
  1. I like the form of the idea, but don't actually like the word "Base". Actually the class specified is the leaf of the xr.Dataset hierarchy, with xr.Dataset itself being the (or a) base. How about "DatasetClass"? or just "Dataset" (or even just "Class")?

  2. I presume that in base: Base[xr.Dataset] = ImageMaskedDataset, the identifier base is arbitrary. What matters is that there is some (and no more than one) class member whose type is DatasetClass (or whatever it is called). If we do have a convention (at least for examples), what about __dataset_class__? (Not polluting the namespace of the dataclass was why I originally didn't implement with a class member.)

  3. Do we need to expose the annotation of Base to the user? Is this so that we can use it for xr.DataArray as well? Perhaps the machinery can be generic, but I don't see a need to put it in the interface (?) just perhaps:

@datasetclass
class ImageMasked:
    data: Data[Tuple[X, Y], float]
    mask: Data[Tuple[X, Y], bool]
    __dataset_class__: DatasetClass = ImageMaskedDataset

Where if we need it we might have somewhere in the implementation:

_DataClass = ClassVar[Annotated[Type[T], "_DataClass"]]
DatasetClass = _DataClass[xr.Dataset]
shaunc commented 3 years ago

Hmm... one issue I have with both my implementation and your proposal is that new isn't typed right.

What about, instead:

@datasetclass
class ImageMasked(DatasetMixin[ImageMaskedDataset]):
    data: Data[Tuple[X, Y], float]
    mask: Data[Tuple[X, Y], bool]

And DatasetMixin would have a new that returned that type? @datasetclass could just omit extension when it saw that. That seems like the cleanest way I can think of right now.

Update -- proof of concept implementation: https://github.com/astropenguin/xarray-dataclasses/pull/46#issue-608352874

astropenguin commented 3 years ago

@shaunc I am sorry for long silence... And thank you for proposing the new implementation.

new isn't typed right.

The generic mixin class works fine, but I think that a generic protocol class exclusively for dataset is simpler in this case. For example:

T = TypeVar("T", covariant=True, bound=xr.Dataset)

class DatasetClass(Protocol[T]):
    __init__: Callable[..., None]
    __dataset_factory__: Callable[[xr.Dataset], T]
    __dataclass_fields__ = Dict[str, Field[Any]]

def asdataset(inst: DatasetClass[T]) -> T:
    dataset = inst_to_dataset(inst) # pseudo code
    return inst.__dataset_factory__(dataset)

where __dataset_factory__ is a callable (either class or function is ok) which converts an instance of xr.Dataset to that of a custom dataset class. Then the following code properly types an output dataset:

class CustomDataset(xr.Dataset):
    __slots__ = ()

@dataclass
class Custom:
    data: Data[tuple["x", "y"], float]
    __dataset_factory__ = CustomDataset

ds = asdataset(Custom(...)) # properly typed as CustomDataset

Because asdataset does the job, we do not need to change DatasetMixin so much:

class DatasetMixin:
    @classmethod
    def new(
        cls: Type[DatasetClass[T]], # the only change
        *args: Any,
        **kwargs: Any,
    ) -> T:
        inst = cls(*args, **kwargs)
        return asdataset(inst)

Then the following code works as well:

@dataclass
class Custom(DatasetMixin):
    data: Data[tuple["x", "y"], float]
    __dataset_factory__ = CustomDataset

ds = Custom.new(...) # properly typed as CustomDataset

Could you let me have time to implement in another branch to check if it really works well (I only checked it in a separate code from xarray_dataclasses...)?

astropenguin commented 3 years ago

@shaunc Please let me close the issue because a similar feature was implemented by #50 and is now available in v0.4.0. Please feel free to raise any issues if you have some requests on the feature. Thank you!