2-component PCA on Hall dataset. red circles are relevant, blue crosses are irrelevant
Characteristics:
Chara 1: relevant class examples are far fewer than irrelevant class examples (prevalence 1-5%). This we can prove.
Chara 2: relevant class examples tend to group together (but not very close) while irrelevant class examples scatters all over the space
Chara 3: no clear boundary between relevant and irrelevant class, many irrelevant class examples are very close to relevant class examples.
What we get from data characteristics:
Chara 2 and Chara 3: It is almost impossible to learn an accurate model but is able to learn a high recall, low precision model (why human instead of machine makes the final decision)
Chara 1: data balancing is required
Chara 1 and Chara 2: better to start training early.
one relevant example alone can provide useful information to guide the review (Chara 2)
expensive to get multiple relevant examples through random sampling (Chara 1)
Chara 2: uncertainty sampling and weighting is required before sufficient number of relevant examples are retrieved. (because the model built on few relevant examples is not reliable)
Chara 3: aggressive undersampling works great for data balancing with a sufficient number of relevant examples since it also removes the noises (irrelevant examples close to relevant class)
Data visualization:
2-component PCA on Hall dataset. red circles are relevant, blue crosses are irrelevant
Characteristics:
Chara 1: relevant class examples are far fewer than irrelevant class examples (prevalence 1-5%). This we can prove.
Chara 2: relevant class examples tend to group together (but not very close) while irrelevant class examples scatters all over the space
Chara 3: no clear boundary between relevant and irrelevant class, many irrelevant class examples are very close to relevant class examples.
What we get from data characteristics: