Similarity between data and target

azhe825 commented 7 years ago

Supported by https://github.com/ai-se/ML-assisted-SLR/blob/master/no_ES/src/runner.py

Data Similarity

LDA on 30 topics (number of topics does not matter much) Topic weighting for the two data sets:

L1 similarity, as default of LDA:

Hall2007- vs Hall2007+: 0.95
Hall vs Wahono: 0.79

L2 similarity, make more sense:

Hall2007- vs Hall2007+: 0.99
Hall vs Wahono: 0.86

Target Similarity

LDA on 30 topics Topic weighting for the two relevant set:

L1 similarity, as default of LDA:

Hall2007- vs Hall2007+: 0.95
Hall vs Wahono: 0.93

L2 similarity, make more sense:

Hall2007- vs Hall2007+: 1.00
Hall vs Wahono: 0.96

Conclusion:

Target of Hall and Wahono are very similar, which explains why UPDATE works.
Data similarity of Hall and Wahono are not that high, but it does not damage the UPDATE performance much.

Problem:

Target similarity is measured by comparing relevant docs, however, before review, this information is not available.

timm commented 7 years ago

can u generate some way of building data sets at increasing distance? see how your conclusions fail as you increase distance?

can u use LDA as a faster way to find relevant topics?

azhe825 commented 7 years ago

Will try generating synthetic data. Preparing for midterm this week.

What do you mean by "use LDA as a faster way to find relevant topics"? Apply LDA+SVM on FASTREAD? I have a preliminary result showing that LDA+SVM, 100 topics, performs bit better than FASTREAD in one run. So it might be promising as the target is clearly one specific topic.

timm commented 7 years ago

Preparing for midterm this week.

roger. focus on that

ai-se / ML-assisted-SLR