Result Summary - Githubissues

azhe825 commented 8 years ago

Hall Result

Hall, Tracy, Sarah Beecham, David Bowes, David Gray, and Steve Counsell. "A Systematic Literature Review on Fault Prediction Performance in Software Engineering."

	Hall Paper	IEEExplore
Initial Size	2073	8912
Final Size	136	106

Wahono Result

Wahono, Romi Satria. "A systematic literature review of software defect prediction: research trends, datasets, methods and frameworks." Journal of Software Engineering 1, no. 1 (2015): 1-16.

	Wahono Paper	IEEExplore
Initial Size	2117	7002
Final Size	71	62

Method Code

Stage 1: Random sampling

P: patient, random sample and review until N=5 relevant studies retrieved.
H: hasty, random sample and review until 1 relevant studies retrieved.

Stage 2: Build classifier

U: uncertainty sampling, sample from nearest points to SVM decision hyperplane. Use labeled data for training until SVM is stable (margin > X=2.0).
C: certainty sampling, sample from SVM most confident relevant predictions. Use labeled data for training. (What it really means is that this method does not have a Stage 2, goes directly to Stage 3)

Stage 3: Prediction

S: simple, certainty sampling in this stage, but stop training.
C: continuous, same as C in Stage 2, certainty sampling, never stop training.

Data Balancing

A: aggressive undersampling, undersample majority training data, only keep data points furthest to SVM decision hyperplane, equal number to minority training data size. Any stage of certainty sampling will apply aggressive undersampling.
N: no data balancing, stages with certainty sampling will not apply any data balancing method.
Baselines

Baseline from Medicine #17

P_U_S_A (patient, uncertainty sampling, simple, aggressive undersampling)

Baseline from Litigation #16

H_C_C_N (hasty, certainty sampling, continuous, no data balancing)
Winner so far

H_U_C_A (hasty, uncertainty sampling, continuous, aggressive undersampling) Hasty and continuous suggested by litigation, Uncertainty sampling and aggressive undersampling suggested by Medicine.

timm commented 8 years ago

please change x-axis to #documents

please copy this to nicholas kraft

azhe825 commented 8 years ago

How do I copy this to Dr. Kraft?

nkraft commented 8 years ago

I have joined the repo now!

timm commented 8 years ago

@nkraft : TL;DR

@azhe825 has:

found the state of the art in legal and biomedical research in using automatic methods to
- find just a few relevant papers then
- quickly find the remaining ones. if you know SVMs, he does a lot of peeking at the margin (distance of closest example to hyperplane).
he's also invented his own method (by carefully recombining parts of the others). turns out, his method works best (see the left-hand curve of the plots above)

then he took two large SE SLRs and asked "how many papers would i have to read to find the papers that those studies found 'relevant'".

with his methods: 400 to 500
without: 7000 to 9000

so i think this is publishable as is but as to next steps....

if we used mechanical turk to identify "relevant" that probably won't work as well as using human experts
but lets say MT is 3 times worse ... but 1000 times cheaper.
we could offer orders of magnitude simplifications in the initial stages of SLRs.

or that's the idea anyway. will it work? well.......

nkraft commented 8 years ago

Very interesting.

First reaction to your next steps: I wonder about expertise vs. cost. MT vs. Ugrads (general population) vs. Ugrads (majors) vs. Ugrads (upper-level majors) vs. Grads vs. Professionals. Where is the sweet spot. And what about sustainability? I could never convince a grad student to help with an SLR a second time. Yet, Tore Dyba cranks them out like a factory line.

When is our meeting scheduled?

nkraft commented 8 years ago

Also, the typical SLR process uses a multi-stage filter: titles then abstracts then papers. Can we model the accuracy vs. cost of each transition? Or does that even matter? Just brainstorming cost model considerations.

azhe825 commented 8 years ago

The meeting is scheduled tomorrow 11am at 3231 EB2, NC State.

azhe825 commented 8 years ago

All our experiments are not actually reviewed by human. The "relevant" examples are taken from existing SLR papers' final inclusion list, which are reviewed by title and abstract and then by full text. However, our algorithm only learns from the title and abstract and achieve the above performance without full text.

Our suggested review process is this.

ai-se / ML-assisted-SLR

Result Summary #22

Hall Result

Wahono Result

Method Code

Baselines

Winner so far