AI4S2S / s2spy

A high-level python package integrating expert knowledge and artificial intelligence to boost (sub) seasonal forecasting
https://ai4s2s.readthedocs.io/
Apache License 2.0
20 stars 7 forks source link

Add iterator feature to traintest #71

Closed geek-yang closed 2 years ago

geek-yang commented 2 years ago

It is necessary to have a iterator in the traintest.py, which enables the user to loop through the splits and perform dimensionality reduction (or machine learning) for each train/test group.

geek-yang commented 2 years ago

An outline about this feature/function would look like (I think):

def splits_iter(data, dr_func = None)
    #loop through all train/test splits (pseudo code below)
    for split in splits: # loop through splits
         clustered_data_train = dr_func.fit(train_data) # get train data from certain split
         clustered_data_test = dr_func.transform(test_data) # same for test data
         # combine data
         data_splits_dr = combine(clustered_data_train, clustered_data_test) # combine all data to a single data-array
    return data_splits_dr # note that the returned data has no lat/lon dimensions but only clustered timeseries

# assume the user wants to perform dimensionality reduction on each split
rgdr = RGDR(target_timeseries, eps_km=600, alpha=0.05, min_area_km2=3000**2)
# assume we got our resampled data array `da_splits` with train/test splits using s2spy
da_splits_dr = splits_iter(da_splits, dr_func = rgdr)

The pros are, the user can get all the clustered results for each train/test split in one pack. However, this limits the flexibility for other operations (e.g. ML) which could have different workflows.

If we opt for flexibility, we could have a enumeration thingy:

for train_split, test split in traintest.splits_iter(data):
    # user defined operations, e.g. RGDR

But then the user needs to manage the data manually.

Any thoughts about it @Peter9192 @BSchilperoort @semvijverberg ?

Peter9192 commented 2 years ago

It is necessary to have a iterator in the traintest.py, which enables the user to loop through the splits and perform dimensionality reduction (or machine learning) for each train/test group.

I'm not sure I understand this, so far. Why is this necessary? So you can subsequently apply a ML algorithm to each train/test group?

Also, note that the function in your pseudo code only returns the result of the last iteration.

Let's take the discussion offline, shall we?

Peter9192 commented 2 years ago

@geek-yang and I just had a very nice discussion, and we came up with a slightly more elaborate example. Initially, we identify three main use cases for making it easy to iterate over the train/test groups:

  1. Assessing whether the clusters identified by RGDR are good/robust
  2. Performing cross-validation to see if the scores of our ML pipeline are any good and robust
  3. Performing tuning of hyperparameters for our model

For each of these use cases, we need to loop over the train-test groups, but it is a bit of a pain to obtain individual groups from our train-test dataframe. Therefore, a generator could come in really handy. Something like (pseudo code):

def iterate(traintest_splits, data):
    for i in range(n_splits):
        if type(traintest_splits) == pd.DataFrame:
            train = data.where(traintest_splits[f"split_{i}"] == "train").dropna()
            test = data.where(traintest_splits[f"split_{i}"] == "test").dropna()
        else:
            # xarray
            train = data.sel(split=i, traintest=="train")   
            test = data.sel(split=i, traintest=="test") 
        yield train, test

This could then be used like so:


data = xr.open_dataset(...)
calendar = s2spy.calendar.MonthlyCalendar(...).show()
splitter = sklearn.model_selection.KFold(...)
traintest = s2spy.traintest.split_groups(splitter, calendar)
data = s2spy.resample(data, calendar)
RGDR = s2spy.dimensionality.RGDR(...)

### 1. Inspecting whether you get robust clusters
for train, test in iterate(traintest, data):
    # Note: test is not needed in this case
    result = RGDR.fit(data.sel(train))
    RDGR.plot()
########################################

### 2. Cross-validation use case
RF = sklearn.models.RF(...)
pipeline = sklearn.PipeLine([RGDR, RF])

# Calculate score for each of the test groups
scores = []
for train, test in iterate(traintest, data):
    pipeline.fit(data.sel(train)))
    score = pipeline.score(data.sel(test))
    scores.append(score)

# See the scores for each train/test group
pd.DataFrame(scores, columns=['score']).plot(kind='bar')
################################

### 3. Tuning hyperparameters (similar to sklearn.model_selection.grid_search)
for parameters in hyper_parameters:
    for train, test in iterate(traintest, data):  
        pipeline.set_params(**parameters)
        pipeline.fit(data.sel(train))
        score = pipeline.score(data.sel(test))
        scores.append(score)

    # For now just plot bar graphs for each set of hyperparameters
    pd.DataFrame(scores, columns=['score']).plot(kind='bar')
##############################################################
BSchilperoort commented 2 years ago

Looks good, it seems like this (again) will result in very little code/routines for us to maintain.

Just one small remark on your pseudocode Peter, you seem to treat the splits and the data itself separately;

def iterate(traintest_splits, data):

While the train/test labels are added to the data in our current implementation. Do we want to keep it this way? I do see that Yang's implementation in AI4S2S/s2spy#74 just has;

def split_iterate(data):
Peter9192 commented 2 years ago

Wel-spotted! I typed this from the top of my head, already had some doubts about it. With respect to keeping this: I guess that's up for discussion. I opened AI4S2S/lilio#46, perhaps we can take the discussion there. Something to consider is that, currently, we would have to apply traintest to both features and labels. Not sure if that is the most elegant approach. Alternatively, we'd have to be able to call the iterator with both labels and features.

semvijverberg commented 2 years ago

Dear all,

I just committed my draft implementation of a traintest splitter, where I tried to address (to some extent) what has been discussed here in AI4S2S/s2spy#71 and AI4S2S/lilio#46. I tried to keep only core functionality, so here's what I did.

Note that I allow to pass a list of arguments for X. This is because there is a difference between our pipeline workflow and the pipeline workflow of scikitlearn.

E.g., X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

pipe = Pipeline([('scaler', StandardScaler()), ('svc', SVC())])

pipe.fit(X_train, y_train)

Note, they have X_train as a simple np.ndarray with shape (samples, features). This is something we do not have. Our X is generally a number of resampled xr.Datasets.

For us, a realistic pipeline would look like: RF = sklearn.models.RF(...) Pipeline([RGDR(y).fit(sst_precursor), RGDR(y).fit(z200_precursor), EOF.fit(OLR_precursor), 'merger_of_features', 'feature_selection', RF])

I hope I'm not going way too fast! But I also feel like we need to take some steps to get ready for the workshop. Also, it is not my intention to get the Pipeline functionality working before the workshop, but I'm just trying to think ahead.

Peter9192 commented 2 years ago

Notice that sklearn splitter internally also implement an interator, e.g. for the shuffle split class: https://github.com/scikit-learn/scikit-learn/blob/36958fb240fbe435673a9e3c52e769f01f36bec0/sklearn/model_selection/_split.py#L1728

Peter9192 commented 2 years ago

Notice that sklearn splitter internally also implement an interator, e.g. for the shuffle split class: https://github.com/scikit-learn/scikit-learn/blob/36958fb240fbe435673a9e3c52e769f01f36bec0/sklearn/model_selection/_split.py#L1728

and also the default split methods return iterators: https://github.com/scikit-learn/scikit-learn/blob/36958fb240fbe435673a9e3c52e769f01f36bec0/sklearn/model_selection/_split.py#L1579