Open piotrpomorski opened 3 months ago
Hi!
"which is unacceptable for the time series split" - why? According to De Prado book it is acceptable as long as you Purge it
It does not really make sense, how do you want to train the model on the future and test on the past? That's like overfitting ** 2. Even if it's purged, the information from the future leaks to the past, making your model perform better on paper by cheating. This is kind of as if you just did StandardScaler.fit(whole_data) and then ran the modeling.
Suppose you have [0,20] range data points. You find optimal params on [0,10] and then test it on [10,20]. Why you cannot go other way round - find optimal params on [10,20] and test it on [0,10]?
Because the [10,20] already knows what happened in [0,10] and financial data already incorporates that info; when you deploy the model live, do you know what will happen in [t+1,t+10]? Any parameter optimisation, model training and testing needs to be done in the exactly same way as when you are going to deploy it live; you never look back, only in front of you. The "find optimal params on [10,20] and test it on [0,10]" is valid for cross-sectional dataset where the time does not matter at all.
It seems there is a major issue with the way the data is split between train and test, which can be seen below: Basically, the train indices occur AFTER the test which is unacceptable for the time series split. The algorithm should make sure that the training indices are picked before the testing are. The snapshot is taken from the
combcv_pl
func inside theParameterOptimizer
class (while debugging).