Jun-02-2016 - Githubissues

Target Problem:
- sampling bias --> imbalance problem in active learning (results delta can be more significant)
Existing literature on solving imbalance problem in active learning scenario
- one paper suggests that active learning outperforms SMOTE on imbalanced data. Learning on the border: active learning in imbalanced data classification 2007
  - SVM only cares about support vectors
  - data near decision boundary are less imbalanced
- Use human resources (filtering and searching) Why label when you can search?: alternatives to active learning for applying human resources to build classification models under extreme class imbalance 2010 (Guided Learning)
- Boosted Disagreement with QBC: Reducing class imbalance during active learning for named entity annotation.
What existing literature lack
- all methods focus on how to select training examples for next generation.
- all methods assume that we already have a initially labeled training set. (Except for Hierarchical sampling for active learning. The problem for this method is that it totally abandoned the good nature of active learning.)
- our assumptions:
  - 1. imbalance in initial training set will affect the active learning performance
  - 1. Hierarchical clustering can balance the initial training set
  - 1. For new stages, we need to consider expert knowledge. e.g. keyword search through elasticsearch first to retrieve a more balanced initial training set.
Negative Results (on multi-classification problem)

The entropy maximization methods do not make a single difference from random sampling!!!

result

ai-se / ML-assisted-SLR