Test FASTREAD on Kitchenham data

azhe825 commented 7 years ago

Original Data

(Data is not stored in this repo.)

1547 records in total.

393 labeled as Y or N, 1 labeled as dup

Labeled subset

Extract all 393 labeled record, 74 is Y, and 319 is N

Run FASTREAD

Start with no label known, random sampling.
query 10 records to get the true label each iteration.
start training as soon as 1 (or 5) Y is found.
stop when no Y is retrieved for 3 iterations (not applied to early stages).
- repeat 30 times

Result:

Blue: 0th percentile (Best case) Purple: 25th percentile (Q1) Green: 50th percentile (median) Brown: 75th percentile (Q3) Red: 100th percentile (Worst case)

Start early, 1 Y

Start later, 5 Y

Conclusion

Half effort can be saved comparing to linear review.
Missed 1 or 2 or none relevant.
NEW: When prevalence (relevant/(relevant+irrelevant)) is high, it is better (more stable) to start later. When the size of data scales down, it is quite normal to have a high prevalence.

NEW schema:

Taken into consideration that when prevalence is high, it is better to start later, we suggest a new schema:

Start with no label known, random sampling.
query 10 records to get the true label each iteration.
start training when (at least 1 Y is found AND at least 40 studies have been reviewed).
stop when no Y is retrieved for 3 iterations (not applied to early stages).
- repeat 30 times

Result:

Benefit:

low variance when prevalence is high
performance stays unchanged when prevalence is low.

timm commented 7 years ago

this is exactly the same as our other "start again" results, right? one of the reason to reuse of lit reviews is to reduce the variance of future reviews.

why isn't the dotted line a straight line? pause. that's cause its real, right? we don't assume straight line linear, we select at random and then report +1 if we find a new one?

timm commented 7 years ago

any reason not to shar this with b.k.? do you have any other data of her's to process?

azhe825 commented 7 years ago

Yes, the dotted line is from random sampling.

This is the only labeled data I got. Will share this result with Dr. Kitchenham.

ai-se / ML-assisted-SLR