alex33d / backtest_optimizer

Optimization of trading strategy hyperparameters with combinatorial cross validation and stress tesing
MIT License
30 stars 11 forks source link

Major flaw with a train-test split #3

Open piotrpomorski opened 3 months ago

piotrpomorski commented 3 months ago

It seems there is a major issue with the way the data is split between train and test, which can be seen below: image Basically, the train indices occur AFTER the test which is unacceptable for the time series split. The algorithm should make sure that the training indices are picked before the testing are. The snapshot is taken from the combcv_pl func inside the ParameterOptimizer class (while debugging).

alex33d commented 3 months ago

Hi!

"which is unacceptable for the time series split" - why? According to De Prado book it is acceptable as long as you Purge it

piotrpomorski commented 3 months ago

It does not really make sense, how do you want to train the model on the future and test on the past? That's like overfitting ** 2. Even if it's purged, the information from the future leaks to the past, making your model perform better on paper by cheating. This is kind of as if you just did StandardScaler.fit(whole_data) and then ran the modeling.

alex33d commented 3 months ago

Suppose you have [0,20] range data points. You find optimal params on [0,10] and then test it on [10,20]. Why you cannot go other way round - find optimal params on [10,20] and test it on [0,10]?

piotrpomorski commented 3 months ago

Because the [10,20] already knows what happened in [0,10] and financial data already incorporates that info; when you deploy the model live, do you know what will happen in [t+1,t+10]? Any parameter optimisation, model training and testing needs to be done in the exactly same way as when you are going to deploy it live; you never look back, only in front of you. The "find optimal params on [10,20] and test it on [0,10]" is valid for cross-sectional dataset where the time does not matter at all.