automl / auto-sklearn

Automated Machine Learning with scikit-learn
https://automl.github.io/auto-sklearn
BSD 3-Clause "New" or "Revised" License
7.47k stars 1.27k forks source link

[Suggestion] Purging and embargoing to deal with unintended data leaks in cross validation. #1589

Open cryptocoinserver opened 1 year ago

cryptocoinserver commented 1 year ago

These approaches are often used in financial ML. Can benefit a wide variety of ML tasks though.

In short: Adding a safety gap between the k-folds or train-, test- and validation splits.

These articles explain it in detail:

https://medium.com/mlearning-ai/why-k-fold-cross-validation-is-failing-in-finance-65c895e83fdf

https://blog.quantinsti.com/cross-validation-embargo-purging-combinatorial/

The Combinatorial Purged Cross Validation mentioned there (it is a little better explained here: https://towardsai.net/p/l/the-combinatorial-purged-cross-validation-method) helps creating more walk-forward paths that are purely out-of-sample for increased statistical significance. This was proposed by Marcos Lopez de Prado in the “Advances in financial machine learning”.

eddiebergman commented 1 year ago

Hi @cryptocoinserver,

Thanks for the informative blog post, very interesting indeed. Unfortunately, we do not primarily target time series data and so k-fold cross-validation works well in this scenario. However there has been a few issues about time series before and you can pass in your own sklearn style splitting mechanism as seen in this example:

Further there is a previous issue to do with this here:

cryptocoinserver commented 1 year ago

Nice. Thank you for the hint. Will take a look at the PredefinedSplit.

BradKML commented 1 year ago

Also as a side note, can K-Fold be used in this example for cross-checking of model performance? https://automl.github.io/auto-sklearn/master/examples/20_basic/example_multioutput_regression.html