Characteristics of Lit review data

Data visualization:

2-component PCA on Hall dataset. red circles are relevant, blue crosses are irrelevant pca

Chara 1: relevant class examples are far fewer than irrelevant class examples (prevalence 1-5%). This we can prove.

Chara 2: relevant class examples tend to group together (but not very close) while irrelevant class examples scatters all over the space

Chara 3: no clear boundary between relevant and irrelevant class, many irrelevant class examples are very close to relevant class examples.

Chara 2 and Chara 3: It is almost impossible to learn an accurate model but is able to learn a high recall, low precision model (why human instead of machine makes the final decision)
Chara 1: data balancing is required
Chara 1 and Chara 2: better to start training early.
- one relevant example alone can provide useful information to guide the review (Chara 2)
- expensive to get multiple relevant examples through random sampling (Chara 1)
Chara 2: uncertainty sampling and weighting is required before sufficient number of relevant examples are retrieved. (because the model built on few relevant examples is not reliable)
Chara 3: aggressive undersampling works great for data balancing with a sufficient number of relevant examples since it also removes the noises (irrelevant examples close to relevant class)