Hierarchical Clustering does NOT help to balance initial training data.
Possible Reasons:
In the data we use, regarding our distance measurement of feature setting, examples within same class do not cluster together. (Means data is wrong or feature is wrong or distance metric is wrong)(Don't quite believe this)
Target class is very rare, sampling without any pre-knowledge or say intention is too hard to find these rare class data. (Means totally blind sampling, even with Hierarchical Clustering is wrong)
Fix: (fix 2 first, if results are good, then 1 is not the problem)
introduce domain knowledge (ask experts to do this)
Experts can use search and filter to better perform this task and achieve a better balanced training data.
Hierarchical Clustering can still be applied, but just to assist the filtering and searching process. Also can let expert decide which axis to split on (is this word useful to distinguish target class?).
Search: exploration, Uncertainty Sampling: exploitation. Let expert to decide flexibly how to balance these two. (expert can decide to start a search at any time during active learning)
Hierarchical Clustering can also help for the visualization of result
Strategy
Basically try things listed in Fix on LN DiscoveryIQ project.
Then map useful techs into Systematic Literature Review in SE.
Systematic Literature Review
Our Task:
Similar task as to LN. How to assist reviews fast retrieve relevant papers by search and active learning.
Can have hierarchical clustering first to guide 1.
Start with a search (filtering)
Review ranked results, label top N as relevant or not. at anytime, go back to a search is possible.
When enough labeled example or enough new labeled example, start a training.
Show user re-ranked results (along with important features and examples, give user handle to change them)
Negative result
Hierarchical Clustering does NOT help to balance initial training data.
Possible Reasons:
Fix: (fix 2 first, if results are good, then 1 is not the problem)
Strategy
Basically try things listed in Fix on LN DiscoveryIQ project. Then map useful techs into Systematic Literature Review in SE.
Systematic Literature Review
Our Task: Similar task as to LN. How to assist reviews fast retrieve relevant papers by search and active learning.
Checked several 2016 paper conducting Systematic Literature Review, some use CiteSeerX as part of the source. Souza, Draylson M., Katia R. Felizardo, and Ellen F. Barbosa. "A Systematic Literature Review of Assessment Tools for Programming Assignments." In 2016 IEEE 29th International Conference on Software Engineering Education and Training (CSEET), pp. 147-156. IEEE, 2016.
No learning involved, just searching and filtering.
Marshall, Christopher, Pearl Brereton, and Barbara Kitchenham. "Tools to support systematic reviews in software engineering: a cross-domain survey using semi-structured interviews." In Proceedings of the 19th International Conference on Evaluation and Assessment in Software Engineering, p. 26. ACM, 2015.
Evaluate tools used in Systematic Literature Review.
Literature on Systematic Literature Review itself (instead of conducting one) Zhou, You, He Zhang, Xin Huang, Song Yang, Muhammad Ali Babar, and Hao Tang. "Quality assessment of systematic reviews in software engineering: a tertiary study." In Proceedings of the 19th International Conference on Evaluation and Assessment in Software Engineering, p. 14. ACM, 2015.
Focuses on how to manage tasks distributed onto several reviewer, how to setup standard rubrics, how to do quality assessment...
Details