Open ecsalomon opened 7 years ago
I'm curious how this would be implemented in something like Triage. Triage isn't sampling at all, is it? It just runs on all the data it is told to.
This is potentially achievable before Triage, by selectively including entities in the cohort that Triage is told about.
We could definitely do something like this internally in Triage, though there is a whole other layer of communication it has to do with the user given that there is no sampling currently. Who made it in to the sample and why?
With severely imbalanced classes, people often undersample the more frequent class or oversample the less frequent class (see https://www3.nd.edu/~dial/publications/hoens2013imbalanced.pdf). There are some standardized methods for this that might be good to implement, but even basic random under/oversampling of a given percentage would be good to have.