Closed NicolaDonelli closed 2 years ago
I suppose we would like to be able to merge Cached and Lazy Datasets, or?
lazy_dataset.union(cached_dataset)
Should throw a type error?
In the current implementation it actually raises a TypeError, thanks to the decorator, but I am not completely sure this is what one could actually want.
I am quite sure that one would want that performing a union from a certain type of Dataset should return a Dataset of the same type, but I am not at all sure that one wouldn’t want to perform a union between different types of datasets.
What do you think about it?
Yes, I understand that this makes it to raise an error. And my point was that I would rather endorse a proposal to make this possible.
I understand that one could always to
lazy_dataset.toCached.union(cached_dataset)
but this seems a bit too verbose
Fair enough. I changed the method removing, from LazyDataset and CachedDataset union method the same_type
decorator (but I kept it for PandasDataset since in that case it is actually required by the code that the united dataset is a PandasDataset).
I'm not sure I understand why this should not apply to PandasDataset. We could really make it such that if you have
pandas_dataset + other_dataset => merged_pandas_dataset
I understand this is not commutative, e g
other_dataset + pandas_dataset does NOT give a PandasDataset
But this is sometime the case in programming, and i don't think it is a super deal breaker to me. But If it is so to you, then fine with keeping the decorator
I'm not sure I understand why this should not apply to PandasDataset. We could really make it such that if you have
pandas_dataset + other_dataset => merged_pandas_dataset
I understand this is not commutative, e g
other_dataset + pandas_dataset does NOT give a PandasDataset
But this is sometime the case in programming, and i don't think it is a super deal breaker to me. But If it is so to you, then fine with keeping the decorator
Actually I am not against it, simply the current implementation of the method for pandas datasets doesn't allow for other types... it might be useful to have it, but before implementing it we should agree on a (fixed?) logic to cast collections of sample to pd.DataFrame (in particular how to set all the "metadata" associated to a pd.DataFrame like colnames, indices, etc...)
I'd propose to close here the PR and maybe open a dedicated issue to discuss which logic exactly should be implemented to unite PandasDataset and other types of datasets since I think that we will need some deeper thoughts on how to define pd.DataFrame metadata starting from sample collections. What do you think?
yes, what you are saying makes sense
Closes issue #6 but maybe the fact that "other" argument for method "union" must be of the same type as self should be discussed more in dept since there could be a use for merging different kinds of datasets (but still always returning self).