Jun-08-2016 - Githubissues

Hierarchical Clustering does NOT help to balance initial training data.

Possible Reasons:

1. In the data we use, regarding our distance measurement of feature setting, examples within same class do not cluster together. (Means data is wrong or feature is wrong or distance metric is wrong)(Don't quite believe this)
1. Target class is very rare, sampling without any pre-knowledge or say intention is too hard to find these rare class data. (Means totally blind sampling, even with Hierarchical Clustering is wrong)

Fix: (fix 2 first, if results are good, then 1 is not the problem)

1. introduce domain knowledge (ask experts to do this)
1. Experts can use search and filter to better perform this task and achieve a better balanced training data.
1. Hierarchical Clustering can still be applied, but just to assist the filtering and searching process. Also can let expert decide which axis to split on (is this word useful to distinguish target class?).
1. Search: exploration, Uncertainty Sampling: exploitation. Let expert to decide flexibly how to balance these two. (expert can decide to start a search at any time during active learning)
1. Hierarchical Clustering can also help for the visualization of result
  Strategy

Basically try things listed in Fix on LN DiscoveryIQ project. Then map useful techs into Systematic Literature Review in SE.

Our Task: Similar task as to LN. How to assist reviews fast retrieve relevant papers by search and active learning.

Can have hierarchical clustering first to guide 1.
Start with a search (filtering)
Review ranked results, label top N as relevant or not. at anytime, go back to a search is possible.
When enough labeled example or enough new labeled example, start a training.
Show user re-ranked results (along with important features and examples, give user handle to change them)
Go back to 2

No learning involved, just searching and filtering.

Evaluate tools used in Systematic Literature Review.

tools

Focuses on how to manage tasks distributed onto several reviewer, how to setup standard rubrics, how to do quality assessment...

ai-se / ML-assisted-SLR