egr95 / R-codacore

An R package for learning log-ratio biomarkers from high-throughput sequencing data.
Other
21 stars 3 forks source link

Recommendation for minimum sample size for classification #8

Open mestaki opened 3 years ago

mestaki commented 3 years ago

Hi folks, Been playing with CoDaCore for a few different projects now and wanted to give a big shout out for this awesome tool, especially its speed is just outrageous. I have a few questions and maybe comments which I'll post as separate issues (so apologies if you get a few notifications in a row). My first question is do you folks have a sense as to what would be a minimum sample size for codacore to perform reliably. I'm sure it depends on a lot on the nature of the data, but generally speaking. Is this similar to regular supervised classification? For example with random forest I usually wouldn't even bother without having at least 40-50 samples. Perhaps more importantly, any recommendations for minimum size of the training set?

egr95 commented 3 years ago

Hi @mestaki,

Thanks for the kind words, glad to hear you found Codacore useful!

Great question. I agree it can depend on the nature of the data, but as a rule of thumb I would make sure to use strong regularization (lambda = 1 or above) as you edge towards the "small n" regime. This value of lambda will ensure that Codacore will only return a balance that beats the "majority vote" baseline by 1 standard error (1SE) in the cross-validation (CV) sense. If you have few samples and the signal is not strong, the CV errors will likely have high variance, which means Codacore will probably return an empty model (no predictive log-ratios found). You could also consider setting the parameter maxBaseLearners = 1, which would constrain Codacore to finding a maximum of 1 predictive log-ratio (i.e., a 1-dimensional model), which would further limit you from overfitting.

Admittedly, we have only experimented with datasets greater than 100 samples, so I can't speak with too much confidence. But ultimately, I'd always make sure to keep at least a couple dozen samples in a held-out test set (so a 50/50 split when there's only 40-50 samples), and to make sure that the predictions of the trained model are better than simple baselines/established benchmarks. If you do observe a significant improvement, that may be enough to identify relevant biomarkers or at least to inform follow-up research.

Thanks!

mestaki commented 3 years ago

Thank you very much @egr95! This is very useful. Just a follow up on this issue of smaller sample sizes. In your CoDaCoRe tutorial, you use the same 80% of both groups for training, even though the CD group is about twice as large as the control group. In my experience this would be problematic with regular RF, if the model is trained with a lot more of one group. Is this not an issue with CoDaCoRe? Or is there a general threshold to look out for? Again with the low sample size in mind, say the data is 50:200 control to case.

egr95 commented 3 years ago

You're welcome!

You are also right to be concerned about class imbalance. It's really up to the user to decide what's acceptable and whether class rebalancing is preferred. In the case of the Crohn's disease data given in the Guide, the sample size is fairly large (~1000) and the class imbalance (1:2) is not too severe, so I decided to keep things simple and avoid overcomplicating the tutorial with a discussion on class imbalance. After all, the main goal of the Guide is to showcase the functionality of the Codacore package itself. However, the example you mention with a low sample size and a more severe imbalance (1:4) does sound more risky, so my advice would be to consider the usual techniques for imbalanced classification, e.g., precision-recall curves, resampling, etc.