Quantify effect on accuracy of different splits

work on #69 has shown that different splits have a big difference on the values of recall on the unseen data used to choose the best algorithm. There are several different ways of doing this, and it is not clear how big a difference it makes and why. Further exploration is merited, and would be a useful result to discuss in the paper. This would involve taking our classifier of choice (decision tree) and running the hyperparameter tuning script with different split settings, such as

StratifiedKFold, no shuffle
StratifiedKFold, random shuffle
KFold, no shuffle
KFold, no Shuffle
GroupKFold, based on even split by year and instrument
different numbers of splits

MetOffice / XBTs_classification

Quantify effect on accuracy of different splits #90