NicolaDonelli / py4ai-core

MIT License
0 stars 0 forks source link

union methods for Datasets should be reworked #6

Closed NicolaDonelli closed 1 year ago

NicolaDonelli commented 2 years ago

I noticed that the method py4ai.core.data.model.ml.Dataset.union takes a Dataset as input returns a LazyDataset while py4ai.core.data.model.ml.PandasDataset.union takes a Dataset as input returns either a Dataset or a PandasDataset (and this translates to PandasTimeIndexedDataset making it return either a Dataset or a PandasTimeIndexedDataset).

Since I cannot see any advantage in being able to handle types of dataset different from self while performing union (if one wants to unite different types of dataset should just cast the latter to the type of the former...) and I do not like very much that especially the union of two objects of the same type do not return the same object type, I'd propose these fixes:

TDataset = TypeVar('TDataset', bound='Dataset')

class Dataset(
    _IterableUtils[SampleTypes, "CachedDataset", "LazyDataset"],
    Generic[FeatType, LabType],
    ABC,
):
...
    @abstractclass
    def def union(self: TDataset, other: TDataset) -> TDataset: ...

class CachedDataset(_CachedIterable[SampleTypes], DillSerialization, Dataset):
...
    def union(self, other: CachedDataset) -> CachedDataset: 
        return CachedDataset([x for x in self.items] + [x for x in other.items])

class LazyDataset(_LazyIterable[Sample], Dataset):
...
    def union(self, other: LazyDataset) -> LazyDataset: 
        def __generator__():
            for sample in self:
                yield sample
            for sample in other:
                yield sample
        return LazyDataset(IterGenerator(__generator__))

class PandasDataset(Dataset[FeatType, LabType], DillSerialization):
...
    def union(self:TPandasDataset, other: TPandasDataset) -> TPandasDataset: 
        features = pd.concat([self.features, other.features])
        labels = (
            pd.concat([self.labels, other.labels])
            if not (self.labels is None and other.labels is None)
            else None
        )
        return self.createObject(features, labels)