ai-se / ML-assisted-SLR

Automated Systematic Literature Review
2 stars 2 forks source link

Characteristics of Lit review data #71

Open azhe825 opened 6 years ago

azhe825 commented 6 years ago

Data visualization:

2-component PCA on Hall dataset. red circles are relevant, blue crosses are irrelevant pca

Characteristics:

Chara 1: relevant class examples are far fewer than irrelevant class examples (prevalence 1-5%). This we can prove.

Chara 2: relevant class examples tend to group together (but not very close) while irrelevant class examples scatters all over the space

Chara 3: no clear boundary between relevant and irrelevant class, many irrelevant class examples are very close to relevant class examples.

What we get from data characteristics:

  1. Chara 2 and Chara 3: It is almost impossible to learn an accurate model but is able to learn a high recall, low precision model (why human instead of machine makes the final decision)
  2. Chara 1: data balancing is required
  3. Chara 1 and Chara 2: better to start training early.
    • one relevant example alone can provide useful information to guide the review (Chara 2)
    • expensive to get multiple relevant examples through random sampling (Chara 1)
  4. Chara 2: uncertainty sampling and weighting is required before sufficient number of relevant examples are retrieved. (because the model built on few relevant examples is not reliable)
  5. Chara 3: aggressive undersampling works great for data balancing with a sufficient number of relevant examples since it also removes the noises (irrelevant examples close to relevant class)
azhe825 commented 6 years ago

Useful? How to better structure it? Where to use it?