measure performance of voting classifier on labelled data

Currently we only generate probabilities for a voting basec classifier for the full dataset. This does not allow us to evalute the performance of the voting. For this we need to:

hold back another test set, as with the current folds based on cruise number, the voting classifier will have seen data from all the folds, so we need to hold back some data from the ensemble for testing purposes.
measure the recall, accuracy etc. of the max prob a dn compare to individual classifiers
check how much disagreement of instrument type there is within the labelled dataset, compared to the unlabelled dataset.

MetOffice / XBTs_classification

measure performance of voting classifier on labelled data #58