Simple random error results

azhe825 commented 6 years ago

Error

Random error: for each labeling task, human has an ER=0.05 chance of labeling incorrectly

Error Correction:

Run error check every CR=50 docs reviewed:

Sort docs(code=='yes') by its prediction_probability on current classifier, pick bottom 5 for recheck
Sort docs(code=='no') by its prediction_probability on current classifier, pick top 5 for recheck
Above two steps are to find papers whose labeling human and machine disagree with
recheck ask the reviewer to label the selected docs again, with same error rate ER=0.05
If a doc has been labeled same as before, it will not be reckecked in the future.

Results (one run, with BM25, SEMI):

Hall

Correct Error: {'falseneg': 4, 'falsepos': 11, 'unknownyes': 6, 'truepos': 96} reviewed 410
No correction: {'falseneg': 4, 'falsepos': 111, 'unknownyes': 4, 'truepos': 98} reviewed 2200

Wahono

Correct Error: {'falseneg': 2, 'falsepos': 2, 'unknownyes': 3, 'truepos': 57} reviewed 1410
No correction: {'falseneg': 2, 'falsepos': 134, 'unknownyes': 1, 'truepos': 59} reviewed 2580

Danijel

Correct Error: {'falseneg': 2, 'falsepos': 3, 'unknownyes': 2, 'truepos': 44} reviewed 710
No correction: {'falseneg': 2, 'falsepos': 48, 'unknownyes': 2, 'truepos': 44} reviewed 940

Kitchenham

Correct Error: {'falseneg': 4, 'falsepos': 6, 'unknownyes': 6, 'truepos': 35} reviewed 390
No correction: {'falseneg': 6, 'falsepos': 16, 'unknownyes': 9, 'truepos': 30} reviewed 370

What happens if error rate increased to ER=0.1

Hall

Correct Error: {'count': 583, 'truepos': 98, 'falseneg': 4, 'unknownyes': 4, 'falsepos': 8, 'unique': 430}
No correction: {'falseneg': 15, 'falsepos': 353, 'unknownyes': 2, 'truepos': 89} reviewed 3460

Wahono

Correct Error: {'count': 1112, 'truepos': 40, 'falseneg': 13, 'unknownyes': 9, 'falsepos': 12, 'unique': 860}
No correction: {'falseneg': 4, 'falsepos': 371, 'unknownyes': 0, 'truepos': 58} reviewed 3740

Danijel

Correct Error: {'count': 1189, 'truepos': 41, 'falseneg': 3, 'unknownyes': 4, 'falsepos': 19, 'unique': 910}
No correction: {'falseneg': 4, 'falsepos': 211, 'unknownyes': 2, 'truepos': 42} reviewed 2100

Kitchenham

Correct Error: {'count': 590, 'truepos': 31, 'falseneg': 3, 'unknownyes': 11, 'falsepos': 9, 'unique': 470}
No correction: {'falseneg': 6, 'falsepos': 51, 'unknownyes': 13, 'truepos': 26} reviewed 450

azhe825 commented 6 years ago

Baseline

screen shot 2017-10-02 at 9 43 39 am

Results when ER=0.1

Hall {'count': 1777, 'truepos': 97, 'falseneg': 5, 'unknownyes': 4, 'falsepos': 14, 'unique': 820}

Wahono {'count': 3583, 'truepos': 59, 'falseneg': 0, 'unknownyes': 3, 'falsepos': 36, 'unique': 1650}

Danijel {'count': 2389, 'truepos': 43, 'falseneg': 2, 'unknownyes': 3, 'falsepos': 35, 'unique': 1100}

Kitchenham {'count': 900, 'truepos': 33, 'falseneg': 1, 'unknownyes': 11, 'falsepos': 6, 'unique': 410}

azhe825 commented 6 years ago

	No Correction	Three Reviewers	Human-machine Disagreements	No Error
Hall	89/3460	97/1777	98/583	102/490
Wahono	58/3740	59/3583	59/2388	59/1165
Danijel	42/2100	43/2389	41/1189	45/760
Kitchenham	26/450	33/900	31/590	38/460

azhe825 commented 6 years ago

30 repeats

medians reported
truepos / cost / falseneg / falsepos

	none	three	machine	No Error
Hall	94 / 3100 / 10 / 281	99 / 1631 / 3 / 16	95 / 683 / 6 / 14	102/490/0/0
Wahono	54 / 3440 / 7 / 341	57 / 3643 / 2 / 44	55 / 1691 / 4 / 20	59/1165/0/0
Danijel	41 / 2570 / 4 / 241	44 / 2049 / 1 / 26	41 / 1060 / 4 / 14	45/760/0/0
K_all3	31 / 460 / 3 / 44	34 / 957 / 1 / 11	30 / 578 / 5 / 11	38/460/0/0

Problem with machine:

Falseneg too high

azhe825 commented 6 years ago

New results: (try to decrease false negatives from machine)

medians reported
truepos / cost / falseneg / falsepos
ER = 5%

	none	three	machine	machine2	machine3	No Error
Hall	98 / 1385 / 5 / 63	101 / 1128 / 1 / 4	99 / 635 / 2 / 3	100 / 725 / 1 / 3	99 / 679 / 2 / 1	102 / 490 / 0 / 0
Wahono	57 / 1880 / 3 / 88	59 / 2913 / 0 / 9	58 / 1554 / 1 / 4	58 / 1651 / 1 / 6	58 / 1510 / 1 / 3	59 / 1165 / 0 / 0
Danijel	43 / 1060 / 2 / 50	45 / 1755 / 0 / 5	43 / 983 / 2 / 2	44 / 1071 / 1 / 4	43 / 976 / 2 / 2	45 / 760 / 0 / 0
K_all3	33 / 430 / 1 / 20	35 / 997 / 0 / 2	34 / 606 / 3 / 1	34 / 588 / 1 / 3	34 / 602 / 2 / 1	38 / 460 / 0 / 0

ER=0.1

	none	three	machine	machine2	machine3	No Error
Hall	93 / 3290 / 11 / 311	99 / 1608 / 3 / 18	96 / 645 / 5 / 11	98 / 823 / 3 / 16	95 / 776 / 6 / 6	102 / 490 / 0 / 0
Wahono	54 / 3455 / 6 / 331	58 / 3744 / 1 / 42	56 / 1696 / 4 / 16	56 / 2161 / 3 / 32	55 / 1694 / 4 / 17	59 / 1165 / 0 / 0
Danijel	41 / 2705 / 5 / 261	44 / 2183 / 1 / 28	41 / 1136 / 3 / 11	42 / 1248 / 2 / 17	41 / 1171 / 3 / 12	45 / 760 / 0 / 0
K_all3	32 / 485 / 4 / 45	34 / 961 / 1 / 10	29 / 593 / 5 / 6	31 / 636 / 3 / 9	31 / 651 / 5 / 6	38 / 460 / 0 / 0

Machine: any paper can only be evaluated 3 times.
Machine2: for each paper previous coded as 'relevant' and currently selected as suspicious, ask reviewer to review it until 3 times max reached. (to decrease false negative)
Machine3: for each paper coded as 'relevant', immediately ask reviewers to review it again until 3 times max reached. (so that no suspicious paper will be picked from 'relevant' side and false negative is expected to be decreased)

timm commented 6 years ago

Please repeat for err equals 1, 2, 4, 8

For numbers we got from mannie (on whiteboard) what overall cost?

azhe825 commented 6 years ago

I emailed Manny for more detailed Error rate information. Since from that on board, I assume that there’s a huge difference between:

ErrorRateA = wrongly label “relevant” doc as “irrelevant”— which leads to low recall
ErrorRateB = wrongly label “irrelevant” doc as “relevant”— which leads to low precision

ErrorRateA might be much larger than ErrorRateB.

ai-se / ML-assisted-SLR