Presumptive non-relevant examples

azhe825 commented 7 years ago

Newest result in e-discovery: Scalability of Continuous Active Learning for Reliable High-Recall Text Classification mentioned one technique to tackle the problem.

Presumptive non-relevant examples. Autonomy and reliability of continuous active learning for technology-assisted review

May be useful for REUSE.

Testing.

What

Each round, besides all the labeled examples, randomly sample from the unlabeled examples and treat them as negative training examples.

Then train the model.

Why

E-discovery

why we need this technique:
- start with 1 positive example (either a real one or a synthetic one)
- lack of negative examples at early stage
why it works:
- prevalence are extremely low -> very low chance to have a positive example treated as negative.
- and it will change each round.

SLR

why we need this technique:
- among continuous review, newly discovered negative examples are all in one corner (not representative)
- lack of representative negative examples at early stage when REUSEing
why it works:
- prevalence are extremely low -> very low chance to have a positive example treated as negative.
- and it will change each round.
- even there are positive examples treated as negative, aggressive undersampling will discard those.

Results

FASTREAD, use this tech or not:

Hall:
Wahono:
Abdellatif:

At least as good as not using it. (worst case result depends on pseudo random, not reliable)

Transfer learning result with this tech:

Hall as previous SLR,

on Wahono:
on Abdellatif:

Wahono as previous SLR,

on Hall:
on Abdellatif:

Abdellatif as previous SLR,

on Hall:
on Wahono:

Conclusions

adding this technique will not deteriorate the performance of FASTREAD.
making REUSE better
UPDATE-REUSE now is always among the best solutions (only on Abdellatif, if we all it to retrieve one or two less than target)

timm commented 7 years ago

thanks for watching the ltierature

timm commented 7 years ago

"Each round, besides all the labeled examples, randomly sample from the unlabeled examples and treat them as negative training examples." i like the idea.

azhe825 commented 7 years ago

It is simple, but effective.

timm commented 7 years ago

important that you should stop soon walking circles in new land till you document the land you have visited. you need to get our 2 more papers, quick smart. high priority.

azhe825 commented 7 years ago

Sure. Please create a blank sharelatex project for me. I will fill in the rest.

timm commented 7 years ago

https://www.sharelatex.com/project/5884f514951819482c691de3

ai-se / ML-assisted-SLR