Stratified splitting to train/test in split_training

pvanni commented 4 years ago

Hello,

As I understood based on the songbird article and looking at the python scripts:

The function split_training (songbird/util.py) takes a set of random samples for the test set. Usually in machine learning analyses you take a stratified split, where the proportion of classes stay the same in both train and test to combat highly unbalanced data. There has also been reports of negative bias in the performance evaluation to test set when the splits are uneven in a cross-validation setting (couldn't find the article from my computer in few minutes, sorry).

In my field (clinical studies) this is especially important as the data is usually very unbalanced with high number of control samples. Most likely I cannot run songbird as is without setting some samples manually to test/train in the metadata. As a machine learning oriented researcher; manually setting train and test samples is a no go. I would like to have a mechanism that stratifies the train/test automatically. I could always code a script that does this in my own metadata, but can cause a lot problems with users unfamiliar with this pitfall.

I think you can fix this easily with few lines of code and perhaps implement it as default or add a split-strategy parameter?

-Petri

mortonjt commented 4 years ago

Hi @pvanni , we'd prefer to avoid incorporating these types of features in songbird for 2 reasons.

Many of these features are already implemented in sklearn. I want to avoid reduplicating code while having minimal dependencies for the sake of installation.
The number of possible model splitting strategies is much greater than provided in sklearn, since we'd have to account for factorial experimental designs (i.e. split according to gender, case/control simultaneously, ...). Properly handling this would be quite formidable and not easily exposable to the command line.

Note that the user can already employ stratified splitting - they would just need to read in the sample metadata as a pandas dataframe and employ one of the sklearn methods. That should be ~3 lines of code. We're open to pull-requests to add a section in the wiki to do explain how to do this.

pvanni commented 4 years ago

Thanks for responding.

Your reasons seem logical.

-Petri

biocore / songbird

Stratified splitting to train/test in split_training #121