Closed pvanni closed 4 years ago
Hi @pvanni , we'd prefer to avoid incorporating these types of features in songbird for 2 reasons.
Note that the user can already employ stratified splitting - they would just need to read in the sample metadata as a pandas dataframe and employ one of the sklearn methods. That should be ~3 lines of code. We're open to pull-requests to add a section in the wiki to do explain how to do this.
Thanks for responding.
Your reasons seem logical.
-Petri
Hello,
As I understood based on the songbird article and looking at the python scripts:
The function split_training (songbird/util.py) takes a set of random samples for the test set. Usually in machine learning analyses you take a stratified split, where the proportion of classes stay the same in both train and test to combat highly unbalanced data. There has also been reports of negative bias in the performance evaluation to test set when the splits are uneven in a cross-validation setting (couldn't find the article from my computer in few minutes, sorry).
In my field (clinical studies) this is especially important as the data is usually very unbalanced with high number of control samples. Most likely I cannot run songbird as is without setting some samples manually to test/train in the metadata. As a machine learning oriented researcher; manually setting train and test samples is a no go. I would like to have a mechanism that stratifies the train/test automatically. I could always code a script that does this in my own metadata, but can cause a lot problems with users unfamiliar with this pitfall.
I think you can fix this easily with few lines of code and perhaps implement it as default or add a split-strategy parameter?
-Petri