Aug-04-2016 - Githubissues

azhe825 commented 7 years ago

Baseline Results and possible improvements

In the scenario of

fixed pool
no external knowledge from experts
labels are all correct

Baseline from Biomedical #17

patient_aggressive_undersampling

Baseline from Litigation #16

hasty_continuous_active

Conclusions drawn:

aggressive undersampling is effective
need patients if aggressive undersample

Two current winners:

hasty aggressive undersampling, if just need around 80% recall, 5% docs reviewed. (Has a stable classifier in the end, but stops learning)
patient continuous active, achieve 100% recall at the cost of reviewing 20% docs. (May be more suitable to handle concept drift as it keeps learning)
Ten repeat result

Future work

Get more data sets to do experiment on. It would be best if one from biomedical, one from litigation.

timm commented 7 years ago

am i reading this right?

you can now improve on standard litigation methods using techniques taking from biomedical?

how does this comment on active learning + text mining in SE? if you did one case study from that domain, we can do an se pub

azhe825 commented 7 years ago

What I want to do for the paper is to borrow methods from biomedical and litigation, combine them to be a better one, apply to SLR in software engineering. The combined method itself can also beat the state-of-art in both biomedical and litigation, if possible.

I am collecting all the efforts to facilitate Primary Study Selection in SLR, SE (referred to as Citation Screening in systematic review, biomedical engineering, and TAR in e-discovery, litigation).

What I found is that:

In Software Engineering community, most efforts are building tools to manage entire SLR process. One study uses Visual text mining (VTM), an unsupervised method to reduce cost of primary study selection. No active learning found. If this is true, it is kind of a blank spot in SLR.
In Biomedical Engineering, they start systematic review years before Software Engineering community. And machine learning applied to assist citation screening start from 2006. At 2010, Byron C. Wallace published his first attempt to apply active learning to assist citation screening, see #17. His method is the patient_aggressive_undersampling in our figure. He continues to explore citation screening by crowd-sourcing, multiple-experts... But the 2010 method is most suitable for a baseline here. Besides Byron C. Wallace, very few studies have been found on the topic (I found one using supervised learning, no comparison with any baseline, results are not good).
In e-discovery, the state-of-art would be hasty_continuous_active in our figure from #16. They have not been compared with Byron C. Wallace's work yet.

timm commented 7 years ago

when would i see SE results? something like your graph above but for stack overflow or george's lit review corpus or....
i take it the above graph is for one data set? need more
why is that the F1 IQR have that characteristic shape? it gets larger as median increases, then dives again. so IQR returning to near zero would be an early stopping criteria
you've got issues for biomedical and litigation. i take it these are lists of papers are arguable state of the art in those fields. what does the (currently missing) third issue report, for SE, look like?

azhe825 commented 7 years ago

when would i see SE results? something like your graph above but for stack overflow or george's lit review corpus or....
- This is an SE result. It is Systematic Literature Review data as described in #13.
i take it the above graph is for one data set? need more
- yes. I plan to get one data set from biomedical and one from litigation.
why is that the F1 IQR have that characteristic shape? it gets larger as median increases, then dives again. so IQR returning to near zero would be an early stopping criteria
- This is not F1 score. X axis is the number of document reviewed, when hit 1.0, all documents are reviewed (7002 in total). Y axis is how many relevant document have been retrieved, 1.0 means 100% (which is 62 relevant documents). This graph is especially suggested by e-discovery people as in #16, and it makes sense to me. In the senario of citation screening, we do not actually care about the F1 score of the classifier, the only need is to retrieve more relevant documents at less review cost (both literatures in #17 and #16 have mentioned this).
you've got issues for biomedical and litigation. i take it these are lists of papers are arguable state of the art in those fields. what does the (currently missing) third issue report, for SE, look like?
- I forgot to open an issue for SE. Will do this soon. Already got the citations from SE in the draft of proposal.

timm commented 7 years ago

when would i see SE results? something like your graph above but for stack overflow or george's lit review corpus or....
- This is an SE result. It is Systematic Literature Review data as described in #13.

my bad. i get it now. can u calibrated x-axis for me? how many documents is x=1

i take it the above graph is for one data set? need more
- yes. I plan to get one data set from biomedical and one from litigation.

what about using the hall12 set? all the references marked in [[double bracket]] in "Tracy Hall, Sarah Beecham, David Bowes, David Gray, Steve Counsell: A Systematic Literature Review on Fault Prediction Performance in Software Engineering. IEEE Trans. Software Eng. 38(6): 1276-1304 (2012)"

if you had #13 and hall, that would be a powerful paper

why is that the F1 IQR have that characteristic shape? it gets larger as median increases, then dives again. so IQR returning to near zero would be an early stopping criteria
- This is not F1 score. X axis is the number of document reviewed, when hit 1.0, all documents are reviewed (7002 in total). Y axis is how many relevant document have been retrieved, 1.0 means 100% (which is 62 relevant documents). This graph is especially suggested by e-discovery people as in #16, and it makes sense to me. In the senario of citation screening, we do not actually care about the F1 score of the classifier, the only need is to retrieve more relevant documents at less review cost (both literatures in #17 and #16 have mentioned this).

then is y-axis precision? and do you get what i saying about how it could be used as an early stopping criteria?

pause

nope. i'm wrong there. the dotted lines come from 10 repeats. in practice, humans would only do 1 repeat

you've got issues for biomedical and litigation. i take it these are lists of papers are arguable state of the art in those fields. what does the (currently missing) third issue report, for SE, look like?
- I forgot to open an issue for SE. Will do this soon. Already got the citations from SE in the draft of proposal.

so i'd be really interested if there is anything like a "standard" active learning method in SE. i'm suspecting "no". which leaves the field wide open for your expert input

recommendation: get the hall results then start writing a journal paper for IST. and get all that done before the semester project load gets nasty. i.e. by early to mid sept

timm commented 7 years ago

and can i get a 1-2 line summary of all the sampling methods? eg. patient_aggressive_undersampling

azhe825 commented 7 years ago

3 important components of each method:

patient or hasty: hasty is start to learn as soon as we get one relevant example; patient is wait for more (5 in the graph) relevant examples and then start learning. (hasty is suggested by litigation, patient is suggested by biomedical)
continuous or not: continuous is ignore uncertainty sampling, start with certainty sampling (pick docs with highest prediction score) and goes forever. Methods without continuous use uncertainty sampling until stable (yellow dot in the first graph, big enough margin in SVM model), then stop learning. continuous is suggested by litigation.
aggressive_undersampling or not: aggressive_undersampling is a data balancing method, which throw away irrelevant training data that are close to the SVM decision plane. I also test SMOTE and in this scenario, aggressive_undersampling outperforms SMOTE. aggressive_undersampling is suggested by biomedical.

azhe825 commented 7 years ago

my bad. i get it now. can u calibrated x-axis for me? how many documents is x=1

7001 docs in total for x-axis. 62 relevant ones in total for y-axis.

what about using the hall12 set? all the references marked in [[double bracket]] in "Tracy Hall, Sarah Beecham, David Bowes, David Gray, Steve Counsell: A Systematic Literature Review on Fault Prediction Performance in Software Engineering. IEEE Trans. Software Eng. 38(6): 1276-1304 (2012)"

Sure, I have started writing the first three section and will get another data set working as soon as possible.

then is y-axis precision? and do you get what i saying about how it could be used as an early stopping criteria?

pause

nope. i'm wrong there. the dotted lines come from 10 repeats. in practice, humans would only do 1 repeat

Actually, the early stop is exactly our goal. We are claiming that by using our active learning method, user can stop at reviewing only 20% of the candidate documents and still get 95% of the relevant ones retrieved.

so i'd be really interested if there is anything like a "standard" active learning method in SE. i'm suspecting "no". which leaves the field wide open for your expert input

As far as I know, no active learning method applied to lit review in SE.

timm commented 7 years ago

As far as I know, no active learning method applied to lit review in SE.

then you'll be the first

timm commented 7 years ago

ping me in 2 days time. i've overdosed on facebook posts today but i could ask a question on fbook to the se crowd if they know of any

ai-se / ML-assisted-SLR

Aug-04-2016 #18

Baseline Results and possible improvements

Ten repeat result

Future work