[ISSUE-6] Update Datasets union method to operate and return only datatests of the same type

NicolaDonelli / py4ai-core

MIT License

0 stars 0 forks source link

[ISSUE-6] Update Datasets union method to operate and return only datatests of the same type #12

Closed NicolaDonelli closed 2 years ago

NicolaDonelli commented 2 years ago

Closes issue #6 but maybe the fact that "other" argument for method "union" must be of the same type as self should be discussed more in dept since there could be a use for merging different kinds of datasets (but still always returning self).

deusebio commented 2 years ago

I suppose we would like to be able to merge Cached and Lazy Datasets, or?

lazy_dataset.union(cached_dataset)

Should throw a type error?

NicolaDonelli commented 2 years ago

In the current implementation it actually raises a TypeError, thanks to the decorator, but I am not completely sure this is what one could actually want.

I am quite sure that one would want that performing a union from a certain type of Dataset should return a Dataset of the same type, but I am not at all sure that one wouldn’t want to perform a union between different types of datasets.

What do you think about it?

deusebio commented 2 years ago

Yes, I understand that this makes it to raise an error. And my point was that I would rather endorse a proposal to make this possible.

I understand that one could always to

lazy_dataset.toCached.union(cached_dataset)

but this seems a bit too verbose

NicolaDonelli commented 2 years ago

Fair enough. I changed the method removing, from LazyDataset and CachedDataset union method the same_type decorator (but I kept it for PandasDataset since in that case it is actually required by the code that the united dataset is a PandasDataset).

deusebio commented 2 years ago

I'm not sure I understand why this should not apply to PandasDataset. We could really make it such that if you have

pandas_dataset + other_dataset => merged_pandas_dataset

I understand this is not commutative, e g

other_dataset + pandas_dataset does NOT give a PandasDataset

But this is sometime the case in programming, and i don't think it is a super deal breaker to me. But If it is so to you, then fine with keeping the decorator

NicolaDonelli commented 2 years ago

I'm not sure I understand why this should not apply to PandasDataset. We could really make it such that if you have

pandas_dataset + other_dataset => merged_pandas_dataset

I understand this is not commutative, e g

other_dataset + pandas_dataset does NOT give a PandasDataset

But this is sometime the case in programming, and i don't think it is a super deal breaker to me. But If it is so to you, then fine with keeping the decorator

Actually I am not against it, simply the current implementation of the method for pandas datasets doesn't allow for other types... it might be useful to have it, but before implementing it we should agree on a (fixed?) logic to cast collections of sample to pd.DataFrame (in particular how to set all the "metadata" associated to a pd.DataFrame like colnames, indices, etc...)

NicolaDonelli commented 2 years ago

I'd propose to close here the PR and maybe open a dedicated issue to discuss which logic exactly should be implemented to unite PandasDataset and other types of datasets since I think that we will need some deeper thoughts on how to define pd.DataFrame metadata starting from sample collections. What do you think?

deusebio commented 2 years ago

yes, what you are saying makes sense