Closed cblyton-byte closed 2 weeks ago
Looks like your code creates splits based on the number of rows (trying to keep a balanced number of rows in each split) - I'm assuming this is desired behavior for your use case.
Another possible way to generate the splits is by even slices of time, regardless of how many records fall into those intervals. I can see some use cases would want to split that way, too.
For this application, each row is a 15 min average, i.e. the timestamps are evenly spaced. Hence even time splits or even rows splits would be equivalent in this case and work for our application.
In general (nice to have but not required in not our application) it would be good to support even time splits for unevenly spaced data.
Scikit learn includes a useful time-series split for cross validation. Importantly, this split does not shuffle data and ensures test sets follow training sets in time. See here: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.TimeSeriesSplit.html
A similar feature is not available in Spark ML. I have written an implementation here (may not be very efficient):
def time_series_split(output, n_splits):
creates a new column in Spark DF 'output' called 'row_num'
Such a feature in Spark ML would be very useful.