patient or hasty: hasty is start to learn as soon as we get one relevant example; patient is wait for more (5 in the graph) relevant examples and then start learning. (hasty is suggested by litigation, patient is suggested by biomedical)
continuous or not: continuous is ignore uncertainty sampling, start with certainty sampling (pick docs with highest prediction score) and goes forever. Methods without continuous use uncertainty sampling until stable (yellow dot in the first graph, big enough margin in SVM model), then stop learning. continuous is suggested by litigation.
aggressive_undersampling or not: aggressive_undersampling is a data balancing method, which throw away irrelevant training data that are close to the SVM decision plane. I also test SMOTE and in this scenario, aggressive_undersampling outperforms SMOTE. aggressive_undersampling is suggested by biomedical.
Hall Result
Hall, Tracy, Sarah Beecham, David Bowes, David Gray, and Steve Counsell. "A Systematic Literature Review on Fault Prediction Performance in Software Engineering."
In the scenario of
Baseline from Biomedical #17
Baseline from Litigation #16
Different setups: