Re-initializing train-test split fails

AI4S2S / lilio

Calendar generator for machine learning with timeseries data

https://lilio.readthedocs.io/en/latest/

Apache License 2.0

5 stars 1 forks source link

Re-initializing train-test split fails #45

Open semvijverberg opened 2 years ago

semvijverberg commented 2 years ago

When re-initializing the train-test split, it fails because the dimension 'split' already exists.

This is easily solved by: if 'split' not in data.dims: data.expand_dims("split")

However, it also exposes an important issue. When re-initializing the traintest split, it should give feedback when traintest split is changed. This communication is purely for clarification purposes. The user might not be expecting the traintest to change when using sklearn.model_selection.ShuffleSplit (because the user might assume the seed is fixed), but it will do that. Hence, there will need to be a test, checking whether the splits that were already present in data are identical the ones that were generated.

Peter9192 commented 2 years ago

Related to AI4S2S/lilio#46 and AI4S2S/s2spy#71. I think eventually "reinitializing" the traintest splits is not something you should want to do. You just instantiate a new instance of a splitter class. Regarding the labels, it might be better if the splitter class (which could have a method that gives back labels aligned with the data) doesn't add dimensions to the data, but rather returns a separate data-array (which the user could then choose to append to his data).

geek-yang commented 2 years ago

Just checked the source code, even with the current implementation, it seems the function makes a copy of the data and return it to the user (see code line 57) https://github.com/AI4S2S/s2spy/blob/d228782c7fa9a6af70cec3f58904bdba48708e56/s2spy/traintest.py#L50-L60

So the user is supposed to call this function multiple times if you want to generate train/test splits in different ways, for instance:

from sklearn.model_selection import ShuffleSplit, KFold
splitter = KFold(n_splits=3)
ds1 = s2spy.traintest.split_groups(splitter, ds)

splitter = ShuffleSplit(n_splits=3)
ds2 = s2spy.traintest.split_groups(splitter, ds)

geek-yang commented 2 years ago

But I do notice that for pandas dataframe the changes are made on the input data. We might want to fix this. But given that we would like to make add_labels a feature to be called by the user, we can change it later, I assume.

semvijverberg commented 2 years ago

Sorry this issue might be obsolete soon.

I agree that the dimension 'split' with labels (train, test ,skip) do not have be added by default.

I'm currently working on a new solution related to AI4S2S/lilio#46, where traintest_split is a class and adding labels is optional via add_labels.

Can we pause this discussion until I have a new draft traintest.py?